$ Esp Spss Tutorial
$ Esp Spss Tutorial
ly/esp-168
Statistic Analaysis
with SPSS
Edition 2024
Telegram: 012 28 93 63
Acknowledgements
I would like to thank the Subject Centre for Information and Computer Sciences (Higher Education Academy)
for funding the development of this SPSS Guide and supporting tutorials.
I would also like to thank David Green, Mathematics Education Centre, Loughborough University, for the
many hours of labour, much of it unpaid, he spent on the project. Without his support, dedication and
enthusiasm the outcome would have been very different.
In addition, I would like to acknowledge the use of the datasets listed below.
Full details of sources are found on the last pages of the Appendix to this Guide.
Population, Gross National Income (GNI) and Gross Domestic Product (GDP) for Countries
The World Bank
The International Monetary Fund
Facebook Users
Internet World Stats
Contents
PART 1 – REFERENCE SECTIONS
Section Page
1 Introduction 1
2.3.1 Cells 4
2.3.2 Cases 4
2.3.3 Variables 4
2.4.1 Variables 4
2.4.4 Values 6
3.1 Toolbar 12
3.2.1 Topics 12
3.2.2 Tutorial 13
4.1 Variables 15
PART 2 – TUTORIALS
TUTORIAL T1: Starting the SPSS program 37
T6.2 One-variable Frequency Table for nominal data – Count and Total 56
T6.3 One-variable Frequency Table for ordinal data – Count with Subtotals 57
T6.4 One-variable Frequency Table for nominal data – Count with order sorted 59
T7.1 Two-variable Two-way Frequency Table for scale and nominal data –
Count, Max, Min, Median 60
T7.2 Two-variable Nested Frequency Table for nominal data – Count & Col% 62
T7.3 Two-variable Two-way Frequency Table for nominal data – Count & Col% 63
T7.4 Two-variable Two-way Frequency Table for nominal and ordinal data –
interchanging rows and columns – Row% 64
T7.5 Two-variable Nested and Stacked Frequency Tables for nominal and scale data – Count 66
T8.1 Three-variable Frequency Table for two nominal variables and one scale variable –
Mean 68
T8.3 Four-variable Frequency Table for three nominal variables and one scale variable –
Median and Mode 70
Guide to SPSS for Information Science vii
T26.2 The One-sample Chi-square Test – expected category values equal 132
T26.3 The One-sample Chi-square Test – expected category values entered 135
T28.2 The Wilcoxon Matched-pairs Signed-ranks Test – for paired samples 146
APPENDIX
This Guide was originally written for PASW 18.0 and has been adapted for IBM SPSS Statistics
19.0. Some PASW screenshots have been retained.
The purpose of this Guide is to enable the reader to perform many of the fundamental operations
of SPSS. The intention is that the reader will then be able to investigate further capabilities of
SPSS, as required. Not every facility available, is detailed, or even mentioned, here. SPSS itself
has an extensive built-in TUTORIAL system to aid further exploration.
In order to use this Guide you will need to understand how to carry out basic Windows operations
using a mouse, to open menus and make menu selections, and re-size windows.
You will also need to understand some basic Windows terminology such as „menu bars‟,
„toolbars‟, „panes‟, „windows‟ and „drop down menus‟. It would be helpful to know the basic
concepts of a spreadsheet (although SPSS operates somewhat differently).
Main windows, top level menu names and options are shown in Arial Black ... like this
All titles, options, buttons, variable names and labels are shown in Arial bold ... like this
Variable values and codes for them are show within single quote in Arial font ... „like this‟
Main text in this Guide is printed in Arial font ... like this
Data Editor. This enables you to insert, view or amend data, and to create or edit data
files. It has two formats: Data View and Variable View.
Viewer. This displays all statistical results, tables and charts, which can be edited and
saved for later use. It opens automatically when you first ask the system to generate
output.
Chart Editor. This editor enables you to modify chart and plots. It is activated by double-
clicking on a previously created chart.
Text Output Editor. This enables you to edit text that is not displayed in pivot tables.
Pivot Table Editor. This enables you to edit pivot tables, such as transposing rows and
columns and showing/hiding parts of tables. [This is beyond the scope of this Guide. See:
Online Help Contents Pivot Tables Manipulating a Pivot Table.]
Syntax Editor. This advanced feature enables you to create and edit command syntax.
[This is beyond the scope of this Guide.]
There are some distinctions between the way spreadsheets operate and the way SPSS data is
organised and displayed in the Data Editor window.
The Data Editor window displays the content of the active SPSS file in either of two formats:
Data View and Variable View. The window would typically have the title:
signifying that the source of the SPSS data displayed is a file with extension sav called
Filename.
Guide to SPSS for Information Science 3
For reference, an annotated Data Editor window (in Data View mode) is shown below. All
Menu commands and Toolbar icons are set out exactly the same in both View modes.
In Data View the data is displayed in a spreadsheet format of rows (representing cases e.g. the
100 best-selling books in GB) and columns (representing variables e.g. position, title, author,
imprint, publisher, volume_of_sales, value, etc.).
In Variable View the data is displayed quite differently – each row represents one of the
variables and each column contains information about an attribute of that variable or how it is to
be displayed on screen (e.g. variable name, type (numeric or character string usually), width
allowed for data entry, number of decimal places etc.).
Guide to SPSS for Information Science 4
2.3.1 Cells
In Data View data is displayed in cells, each
item of data in a cell being known as a value.
Each cell contains a single value of one variable
for a particular case. Unlike a spreadsheet, cells
in SPSS cannot contain formulas.
2.3.2 Cases
In Data View rows are cases. Each row
represents a different case. A case is a set
of observations about one person, one
country, one object, one experiment, etc.
As in a spreadsheet, SPSS numbers each row but this is not tied to, or part of, the case. Often
a unique ID number is provided for each case which is tied to the case (being a variable), as in
the example here.
2.3.3 Variables
In Data View columns are variables. Each column represents a different variable. A
variable is a measure of a characteristic or outcome that is being observed, measured or
generated. A variable can take different values. For example, the response to each item on a
multiple choice questionnaire would be a separate variable (which could take different values).
A name must be provided for each variable (e.g. book_title).
2.4.1 Variables
Variables are usually created and their
attributes defined in Variable View.
E.g. if the variable name was age then the label might be
„age of the respondent in years on 1 January 2012‟; if the
variable name was bkdate then the label might be „date of
publication of the first edition‟.
These values are just „names‟ or „categories‟ with no specific way to order or measure them. The
corresponding variables are classified as string variables as they are just character string. String
variables can use digits as well as alphabetic characters (e.g. case numbers, bank account
numbers).
Some statistics books call this type of data categorical, and SPSS uses the term category for
the axis of a chart of a nominal variable.
These values are more than just „names‟ or „categories‟ because there is an obvious way to order
them. However, they cannot really be measured mathematically.
In Ex 3 and Ex 4 you can meaningfully say „infant‟ is less than „child‟ and that „0–5‟ is less then „6–
12‟ but you cannot say that one is (say) 3 more than the other, or one is half the other.
Even when there is a single number (a numerical code) to signify each value (as in Ex 5) it does
not mean the rules of arithmetic apply. It is true that „4‟ signifies being friendlier than someone
who is classified as „2‟ but that does not mean twice as friendly! Furthermore, it is not meaningful
to say the difference in friendliness between „5‟ and „3‟ is the same as the difference between „3‟
and „1‟.
N.B. There is a complication with the typical five point scale such as in Ex 5 if there is a further
code or codes (e.g. „6‟ for „Don‟t know‟). Unless this extra code is treated as a missing value (see
Section 2.4.6) and excluded from most statistical analyses you cannot really claim to have
ordinal data, and it should be considered nominal.
Guide to SPSS for Information Science 6
Here the difference between „95‟ and „96‟ is the same as the difference between „100‟ and 101‟.
However, it is not meaningful to say that what is measured as „100‟ is twice „50‟. This is because
there is a false origin (zero position) and temperatures can be below zero.
In all these cases the rules of arithmetic do apply – „100‟ is twice „50‟, and the difference
between „95‟ and „96‟ is the same as the difference between „100‟ and 101‟.
SPSS calls this type of data scale too. So it does not differentiate between Interval and Ratio.
This is because statistical procedures which apply to one of these will also apply to the other.
2.4.4 Values
Values for variables can be the actual data, e.g. if the variable is age then the values could be the
actual ages of respondents in complete years. The values could instead be codes representative
of the actual data. For example, for the variable gender the values could be „1‟ representing
„male‟ and „2‟ representing „female‟. Alternatively the full words could be entered, or abbreviations
used such as „M‟ and „F‟. It is normal to use numerical codes rather than strings as they are more
amenable to statistical procedures.
For string variables, a blank or series of blanks is considered a valid value and so is not
interpreted as signifying missing data unless explicitly declared as such. There is, therefore, no
system missing value for string variables.
Respondents to a questionnaire may answer a question but not provide a valid measure for the
variable (e.g. claiming an age of 200). When this occurs the value could be entered but then
explicitly declared a missing value or, more likely, an invalid-response code could be entered
instead. For example one might use the number „999‟ to represent an invalid response for the
variable age. The number „999‟ would be a user-defined missing value. A different code could
be used to signify a no-response.
The number used for a missing value must, of course, be one that could not possibly occur for
the variable in question. E.g. the code for signifying no-response to number_of_children could
not be „0‟ or „9‟ but it could be „99‟.
Alphabetical order is certainly better than file order if there are a lot of variables to search
through. Names or labels will depend on whether the names are clear and distinct enough or
not. Below are some screens showing what one gets. The example is for the Frequencies
procedure which will be introduced in TUTORIAL T3.
The right screen below has Display labels in Alphabetical order (note that the name is
included in square brackets after the label).
NOTE: You can right-click on a variable list and choose the display format you want without
needing to use Edit Options.
Guide to SPSS for Information Science 9
To change the d.p. level use Edit Options and select Data. Then click the up/down arrows
to achieve whatever d.p.level you want for all future new variables you create. Don‟t forget to
click Apply before finishing with OK.
To do so use Edit Options and select Currency. Then click on CCA (or another) and
enter the required prefix (e.g. £) and choose period (i.e. full-stop) as the decimal separator.
Don‟t forget to click Apply before clicking OK.
To make changes use Edit Options and select Output Labels. Then for Pivot Table
Labelling use the drop-down menus to select what you want for Variables in labels shown
as and for Variable values in labels shown as:. Then click Apply and finish with OK.
• Moving an item in the outline pane moves the corresponding item in the contents pane.
• Selecting an object makes it available for editing, printing or exporting (e.g. to a Microsoft Word
document).
• Double-clicking an object makes it amenable to editing: text and numbers can be changed, and
the width of columns varied by dragging with the cursor.
All visible exports all the output apart from the hidden commands (N.B. it may not
actually be showing on screen, depending on what you have selected)
The selection can be just one object whose selection is shown by a short red pointer in the outline
pane and in the contents pane (though it may be hidden unless you widen the window) or a whole
group selected in the outline pane
Each object shown in the outline pane as an open book icon will print. Each object shown as a
closed book icon will not print.
If any open data files or output files have not been saved, the user is alerted.
Guide to SPSS for Information Science 12
3.1 Toolbar
Move the arrow cursor across the toolbar and position
it over any one of the icons.
Topics
Tutorial
Case Studies
Statistics Coach
3.2.1 Topics
The Contents list appears from which a topic can be selected (see example below). Also
available are an Index and a Search facility, into which one can type a keyword.
Guide to SPSS for Information Science 13
3.2.2 Tutorial
This presents a Table of Contents which one can browse to find illustrated, step-by-step
instructions on many basic SPSS features.
Some of the tutorials use demo data files (see example below).
Guide to SPSS for Information Science 14
Statistics Base
Advanced Statistics Option
A tutorial appears asking the user „What do you want to do?‟ and presents information to help the
user to select the appropriate presentation method or statistical test to employ. (See below.)
If a Help button can be seen in the current window in which you are working, then
you can click on it to obtain context–sensitive help.
A window titled Online Help will appear containing text specific to the task in hand. I.e. It takes
you straight to the relevant Help page to save you having to look for it.
Guide to SPSS for Information Science 15
A column in the Data Editor window of SPSS stores all the values for one variable.
It is advisable, though not necessary, to create the variables in the data file before the data values
are keyed in. If the column of data is entered first then SPSS creates a dummy name (VAR0001,
and so on) which can be edited subsequently.
1. Click on the tab Variable View at the bottom of the Data Editor window.
2. Type in the first variable name into the first cell of the column headed Name and press .
• Maximum of 64 characters – upper and lower case letters, digits and some symbols
• Name must begin with a letter or @ symbol
• No spaces allowed (underscore is often used in its place)
• Punctuation marks are allowed (e.g. full stop)
• Cannot be one of the few reserved keywords (e.g. NOT)
• Default attributes of the variable appear in columns to the right of Name. These are:
Type – Numeric
Width –8
Decimals – 2 d.p.
Label – blank
Values – None
Missing – None
Columns –8
Align – Right
Measure – Unknown
Role – Input
It is best to have a short variable name that is easily recognised as related to the underlying
variable or to the source (e.g. question number).
Guide to SPSS for Information Science 16
3. Type
To change the variable type, click the appropriate cell in the column titled Type (which will
normally contain the default Numeric), click on the three dots shaded in grey to the right of
Numeric and click again to open the Variable Type dialog box.
4. Width
For a string variable, the width determines the maximum number of characters allowed in
the string. For example if a string has width set to 3 then „CAT‟ can be entered but „MOUSE‟
cannot. It can be useful when entering string data to prevent certain input mistakes. A string
can have a maximum 32767 characters! Any string shorter than the width is „padded‟ on the
right with blanks.
For all other variables (numeric), the width specifies the expected maximum width for the
number (not the maximum number of digits allowed). A numeric can have a maximum 40
digits (maximum 16 decimal places)
To change the width of a variable click the appropriate cell in the column titled Width, and
then type in the value you want or use the up and down scroller. (Alternatively, you could
edit the Width within the Variable Type box discussed in 3 above.)
5. Decimals
To change the number of decimal places of a numeric variable displayed, click the
appropriate cell in the column titled Decimals, and type in the desired number or use the up
and down scroller. The maximum is 16. This does not affect the actual number of decimals
in the variable. (Alternatively, you could edit the Decimal Places within the Variable Type
box discussed in 3 above.)
6. Label
To enter a variable label, move the cursor to the Label column and click the appropriate cell,
and then type in the label of your choice.
• The variable label is an explanation of what the variable is, e.g. if the variable
name was sex then the label might be „gender of the respondent‟.
7. Values
To enter values and value labels, move the cursor to
the Values column and then click it. Click on the
three highlighted dots just to the right of None to open
the Value Labels dialog box.
By way of example, suppose a question can be answered „Yes‟, „No‟, or „Don‟t know‟ which
are coded „1‟, „2‟, „3‟ respectively. To enter values and value labels proceed as follows:
First, type in the first value that can represent the variable – in this case „1‟.
Second, click on the Value Label: field of the Value Labels window and type in what the
value represents – in this case „male‟.
• The value and its meaning now appear in a box next to the Add button.
Repeat the process to insert the „No‟ and „Don‟t know‟ values.
8. Missing
To define missing values, move the cursor to the Missing column and then click the
appropriate cell. Click on the three highlighted dots to the right of None to open the Missing
Values dialog box.
The purpose of defining missing values is to prevent SPSS including them when doing
calculations (e.g. finding the mean of a set of numbers).
• E.g. „9‟ to signify no response to an MCQ which has allowable choices coded „1‟ to „5‟.
9. Column
The width of a column displayed in the Data Editor in Data View mode can be set
using this. It does not affect the underlying variable, only what is actually displayed. In
practice, this is little used as the widths of columns in Data
View can be over-ridden by dragging. To change the declared
width, in Variable View type in the desired number or use the
up and down scroller to vary the current entry.
Guide to SPSS for Information Science 18
10. Align
The default alignment of numeric data in the Data Editor in Data View is the left
margin, and the default alignment for string data is the right margin. Sometimes it is
preferable to centre data or even align the data to the opposite margin. This column enables
this to be done by selecting the variable‟s cell in the Align column and clicking on the
downward arrow to select one of the alternatives Left, Right or Center.
11. Measure
Initially, Measure is set to „Unknown‟.
For all Types except „String‟ you need to click on Measure and choose from „Nominal‟,
„Ordinal‟, „Scale‟.
If you set Type as „String‟ then Measure is automatically set to the default „Nominal‟. String
variables can be designated „Nominal‟ or „Ordinal‟.
12. Role
This is a new advanced feature, beyond this Guide‟s scope. (Default is: Input.)
1. If you are opening a data file on a USB memory device, insert it.
► A horizontal scroll bar will appear below if there are too many entries to fit in the window.
5. Locate and select the folder or file required. Double-click a folder name to reveal the list of
files (which are in .sav format) and click on the required file.
► The selected file will now be loaded into a Data Editor window in Data View mode.
Guide to SPSS for Information Science 19
Data stored in the cells displayed in the Data Editor window can be edited in several ways.
1. To change an entry, click on the cell which will highlight the data, and key in the new data
and press the Enter key.
2. To delete an entry, click on the cell and press Backspace key or use Edit Clear.
1. Click and hold down the mouse button on the first (top left) cell that is to be copied or moved.
2. Drag the mouse pointer to the last (bottom right) cell to highlight the block of cells that are to
be copied or moved, and release the mouse button.
Alternatively to 1 and 2, click the top-left cell and then shift-click the bottom-right cell.
4. If the cells are to be copied click on Copy on the drop down menu, and then go to step 6.
5. If the highlighted cells are to be moved click on Cut on the drop down menu.
6. Click on the first destination cell to which the copied cells are to be moved or copied.
1. Ensure Data View is selected (by clicking on the tab at the bottom of the Data Editor
window if necessary).
2. Then click on the case number immediately below where you want the new case inserted,
which selects that case (row).
The new case will be inserted immediately above the selected row.
1. Ensure the Data View is selected (by clicking on the tab at the bottom of the Data
Editor window if necessary).
2. Click on the number of the case to be deleted (this action selects the case).
1. If Variable View is selected, click on the number of an existing variable (or blank row)
where the new variable is to be inserted (this selects the row for that variable).
If Data View is selected, click on the name of an existing variable (or blank column) where
the new variable is to be inserted (this selects the column for that variable).
• The new variable will be inserted (it will be named VAR00001 or similar, which you can edit).
It can be useful to have a variable with data identical to another. The data can be placed in an
existing variable (whether empty or not) or in a new position, creating a new variable.
2. Click on the variable name at the top of the column to be copied. This highlights the column.
1. If Data View is selected, click on the variable name at the top of the column.
If Variable View is selected, click on the variable number at the left of the row
If data is not saved regularly there is the danger of losing a great deal of work.
It is wise to save the data into a new file each time rather than overwriting the existing file, using a
progressive numbering system.
3. If the data is to be stored overwriting the current file (not normally advised!):
Click on File on the menu bar and Save on the drop down menu.
Click on File on the menu bar, and Save As on the drop down menu.
5. Click on the down arrow next to the Save In box and locate and select the USB drive, or
your Central File Store, or wherever else you wish to store the data.
• Note: the dot symbol should be used in a name with caution. If the dot is followed by
exactly three letters SPSS will think it is an extension name (determining the file type)
and will not save the file as a .sav file in which case it will not be recognised by SPSS
when you try to locate it to load in again.
Data can be imported from a wide range of spreadsheets, databases and text files. The most
common is Microsoft Excel spreadsheet, which is the subject of this section. (For other valid
sources and how to import then see the Online Tutorial „Reading Data‟.)
It can be easier to first enter one‟s raw data in Excel and only when satisfied with it to transfer it to
SPSS. Sometimes secondary data will already be available in Excel, which can be copied straight
across to SPSS.
It is important that the Excel data is set out with rows for the cases and columns for the variables.
Locating the required Excel data file can be a challenge, as the following will indicate.
Selecting File Open Data will produce an Open Data window something like this:
The Look in entry specifies whereabouts (in computer memory, USB sticks, etc.) data files and
folders are being looked for.
This can be changed by clicking on the down arrow and navigating through memory.
This is done by opening the Files of type box down and selecting the format required, like this:
There is an option to display all types of files – All Files (*.*) – which can be useful sometimes. (It
is hidden at the bottom of the list of formats, so scroll down to reveal it.)
An important matter is whether the first Excel row contains headings (which can become the variable
names) or whether there are no headings and the first row represents data for the first case.
SPSS assumes the first row are headings (indicated by a tick in the „Read variable names from
the first row of data‟ tick box – see above). It is essential to indicate if this is not true by removing
the tick, otherwise the first case will be lost.
If no Excel headings are indicated, then SPSS invents variable names (VAR0001, VAR0002 and
so on, which can be edited).
Excel headings, if present, may not conform to SPSS requirements (e.g. spaces are not allowed
in SPSS variable names; names must begin with a letter or @ symbol). In such cases SPSS
amends the variable names to conform.
Guide to SPSS for Information Science 24
The SPSS default is to import all of the spreadsheet but the user can import just part of the
spreadsheet by specifying a reduced range (e.g. A1:G50). In fact, it is a good idea to specify
the correct range, even when you want the whole spreadsheet, as SPSS has a habit of loading
in superfluous cases and variables and filling them with the system missing value symbol (a
full stop). (This can happen, for example, if there is a spurious entry hidden away outside the
correct range.) These phantom cases and variables would have to be deleted before analysing
the data.
SPSS data files can be exported into a wide range of file format. The most commonly used is
Microsoft Excel, which is the subject of this Section. (The procedure is very similar for other
formats.)
Having created an SPSS data file (e.g. Test_Data.sav) one proceeds as follows:
1. File Save As
It could be interesting to see if that or some other book earned the most money. There is a
variable for that – Value – so we can sort on it, as follows:
Data Sort Cases produces the Sort Cases window below left. Selecting Value and moving
it using the blue arrow into the Sort by box, clicking on the radio button Descending produces
the window below right. Clicking OK then initiates the sorting. The „(D)‟ after the variable name
indicates that descending order sorting has been chosen.
This shows that The Da Vinci Code is relegated to fourth place and Harry Potter books occupy
the first three places. Note that when SPSS sorts cases it always moves the whole row, i.e. it
preserves the case intact. (This does not happen with Microsoft Excel unless the whole
spreadsheet is selected.)
Guide to SPSS for Information Science 26
This example sorted on one criterion – Value. Multiple criteria sorting is possible by moving more
than one variable into the Sort by box (in the correct order). For example, to find the cheapest
hardback book in the Top 100 one would need to sort first on Binding (ascending order –
alphabetical) and then on ASP (ascending order – numerical). Some of the stages for this are
illustrated below.
The result, below, shows that The Tales of Beedle the Bard gave JK Rowling another first place,
at a remarkable ASP of only £4.03, with The Very Hungry Caterpillar not far behind.
As an example, suppose we wished to select just those books among the top 20 best-selling
Paperback books with RRP under £10 first published in either of the two years 2003 and 2004
this could be achieved as follows. First the criteria need to be put into terms SPSS will be able to
interpret. They are:
By a combination of moving variable names into the formula box using the blue arrow
and typing in numbers and arithmetic symbols (using the supplied on-screen keypad
or the keyboard) the following formula is inserted:
SPSS is very fussy about how the formula is constructed so it is wise to write it out on paper first
– and think about it!
For further guidance on using Select Cases and in particular constructing formulae see the
online Help (Help Topics then Search for „select cases‟).
The result of applying the formula is shown here. Six cases are selected (highlighted with arrows).
SPSS puts diagonal lines through row numbers of excluded case. There are just six without a line
through that have been selected as satisfying the criteria.
It is wise to check one has the right outcome – getting these formulae right can be a challenge.
► There are many other options in the Select cases window – selection can be randomly, by
row number, by date, by variable values and ranges, etc.
Guide to SPSS for Information Science 28
Transform Compute Variable... produces the Compute Variable window. Inside this
window, in the Target Variable box, must be typed a name for the new variable (e.g. Penetration).
One question of interest is “Is there any gender difference?” A frequency table (below) has too
many cells with small numbers in for a common statistical test called Chi-square to be valid.
In order to overcome this, the Recode command is needed to combine classes. The choice
made is to reduce the 13 classes down to 7 classes (with approximately equal frequencies for
each) as follows: 0 to 3 1, 4 2, 5 3, 6 4, 7 5, 8 6, 9 to 12 7.
The procedure is quite lengthy and intricate, but commonly needed, so worth mastering.
Guide to SPSS for Information Science 30
opens the Recode into Different Variables window (shown below). First the existing variable to
be recoded must be passed from the list on the left into the Input Variable Output Variable
box on the right. Then a name for the new variable must be typed in (mod_code) and, optionally,
a suitable label for it entered (Module numbers group). To confirm this Change is clicked.
Once Add is clicked, the second recoding command is moved into the Old
New box and the Old and New Value entries are cleared.
Note: There is no need to specify what the highest or lowest values are (they
might not be readily known in a very large data set). Instead one could use
the Range, LOWEST through value radio button for 0 3
and the Range, value through HIGHEST radio button for 9 12.
This will produce a report in the Viewer Output window as below. Normally these are of limited
interest but this one is worth taking notice of because it reports exactly what recoding has been
done. (It is very easy to not do what one intended or to forget what one did.)
There is one final matter to attend to which is to check that the newly created variables‟ attributes
are what they should be. SPSS provides defaults and it doesn‟t always get it right. It is necessary to
change the Data Editor window to Variable View. Below is SPSS‟s defaults for mod_code:
Guide to SPSS for Information Science 32
Finally, the new frequency table is shown below for comparison with that obtained originally.
The data is now in a form which satisfies the conditions of the statistical test.
Guide to SPSS for Information Science 33
Graphs and charts take many forms and can be obtained in many different ways in SPSS. The
two main procedures have huge choices of charts and ways to edit them. This can make it a
frustrating process to find out how to get exactly what you want.
Numeric data can be used for nominal (categorical) variables, ordinal variables or scale
(interval and ratio) variables.
Different kinds of chart are appropriate for different types of variable (nominal, ordinal,
scale).
If you try to generate a chart not suitable for the type of variable chosen then a warning
will be given. You can either choose another method or (sometimes) change the type
(Measure) of the variable.
There are two fundamental ways to obtain charts and graphs in SPSS:
Graphs Legacy dialogs A method used in earlier SPSS versions (hence the
word „legacy‟) which some users prefer.
The Graphs and Charts TUTORIALS later in this Guide will concentrate on Chart Builder....
In the next five sections we introduce for reference the Chart Editor menu bar and its four
associated toolbars for producing graphs and charts.
The Chart Editor Menus and Toolbars can only be described as complex and
comprehensive. There is a great deal of overlap between what is available (a) through the
Menu options, (b) selecting icons on the Toolbars and (c) by double-clicking on the regions
and elements of the Chart itself. For reference, below are details of the Toolbars. The Format
toolbar will be familiar to all Windows users.
Guide to SPSS for Information Science 34
The Edit, Options and Elements menus include all the items in the equivalent Toolbars,
plus a few more. The advantage of using the Menu Bar rather than Toolbars is that the
menus have descriptive words as well as icons. The Toolbar icons are, by default, always
visible and quicker to use. The icons may not be so easy to recognise, but their descriptions
appear when hovered over with the mouse cursor. Toolbars have an advantage over Menus,
as the correct menu has to be opened to find the procedure sought.
The positioning of the Toolbars will depend on the width selected for the Chart Editor
window.
The View menu has options to turn on and off the Toolbars.
The last item in the View menu (Large Buttons) enlarges all the
icons displayed in the selected Toolbars which can be a helpful
aid to recognition.
On the next page, for reference, are the four Toolbars with
annotations.
To check what each FORMAT icon represents you can hover over each with the mouse to
reveal a description of the action, as shown in the diagram below.
Guide to SPSS for Information Science 35
To check what each OPTIONS icon represents you can hover over each with the mouse to
reveal a description of the action, as shown in the diagram below.
Eleven prepared .sav data files are available for working with this Guide, which should appear as:
The contents of these data files are indicated below. Further details are given in the Appendix.
DATA01 100 best-selling books 1989 to 2010 (Nielsen) - UK 100 2, 3, 19, 20,
27, 28, 30
DATA02 Internet usage by age / gender – Europe 10 5
DATA03 University students‟ responses to a VLE questionnaire - UK 150 8-10, 12-18,
21, 24, 26-30
DATA04 Open Access policies in HEIs - UK 39 4, 6, 7, 8
DATA05 Open Access practices of researchers - UK 418 15
DATA06 School students‟ attitudes to mathematics - UK 180 29, 31-34
DATA07 IT Piracy rates, Population, GDP and GNI - Worldwide 109 22, 23
DATA08 Facebook and Internet Users - Worldwide 157 None
DATA09 Internet Users - Europe 31 None
DATA10 Literacy rate, Population, Land area, GDP and GNI - Worldwide 155 None
DATA11 Internet Users by World Region - Worldwide 7 Reference
Section 9.1
Guide to SPSS for Information Science 37
PART 2 - TUTORIALS
TUTORIAL T1: Starting the SPSS program
For instructions on how to load SPSS and locate data files on the computers you are using see
separate documentation or consult the appropriate Appendix (if provided).
2. Click on the start icon located in the bottom left corner of the screen
3. Locate the list of programs (select All Programs if necessary) and find
the SPSS program (e.g. IBM SPSS Statistics 19).
► If the dialog box appears, it will present various button choices. In practice, users often find
it easier not to use this dialog box and instead use the drop-down menu options from the
main Data Editor window. (If that is preferred, then click in the „Don‟t show this dialog in
the future‟ box at the bottom of the window before closing the window.)
► The Data Editor window will appear, and all its menus and tools will be displayed.
► At this point you have a choice: either enter data yourself (Reference Section 4 and
Tutorial 4) or load a prepared data file (Reference Section 5, and all other Tutorials).
Guide to SPSS for Information Science 38
6. Click on DATA01_100Books.sav.
► The selected file will now be loaded into a Data Editor window, which may be in Data
View mode (illustrated below) or Variable View mode.
8. If necessary, change to Data View (by clicking the bottom left button). Scroll round the data
set to view all the variables (columns) and to check that there are 100 cases (rows).
Guide to SPSS for Information Science 39
► You can enlarge the window to reveal more information, and scroll also.
Sort Alphabetically.
2. Select Hardback or Paperback [Binding] and click on the right-pointing blue arrow to move
it into the Variable(s) box for processing. Repeat for Year of Publication [Year].
3. Click OK to produce Output displayed in the Viewer window, part of which is shown
below.
► Firstly, this shows that there were two variables analysed with 100 valid cases for each,
and no data missing for either.
► Secondly, this shows that there were 28 Hardbacks and 72 Paperbacks in the Top 100
list. This variable – Binding – takes two values which are simply names. It is a nominal
variable producing nominal data.
► Thirdly, this shows the Year of Publication numbers of Top 100 books (e.g. 3 in 1994).
This variable – Year – takes many discrete values which have an order to them. It is an
ordinal variable producing ordinal data.
4. We now return to the Frequencies procedure and explore the Statistics options.
Select Analyze Descriptive Statistics Frequencies again.
5. The same two variables should still be in the Variable(s) box on the right. Click Reset to
remove them.
► If necessary, you can alter the format of the current displayed list by right-clicking on the list
and choosing the display format you want: Display Variable Labels and Sort Alphabetically.
Guide to SPSS for Information Science 41
8. Click Continue.
► Note the warning below the Statistics box stating that there are other modes for Number sold.
That is because each book sold a different number of copies so every „sold‟ number has
frequency 1 and therefore they are all technically modes (the most common). So here „mode‟
is a useless statistic! Mode is only of interest when there is one or two of them.
11. We return to the Frequencies procedure and explore its Chart options.
Click Reset to remove any selected variables and select Month of Publication [Month].
12. Open the Charts option and click Bar charts. Leave Chart Values as Frequencies.
14. Click OK to obtain the chart in the Output window (scroll down to see it, if necessary).
Guide to SPSS for Information Science 42
The above has provided a short introduction to one of the simpler SPSS procedures. Even so,
there is has been a lot to learn. And SPSS has many other procedures: Analyze is just one of
four main procedural menus; the Analyze menu has 22 menu options (of which about seven are
likely to be of use to most Information Scientists); of these seven, the Descriptive Statistics
menu option itself has seven menu options (of which three are mostly used).
Fortunately, learning to use Frequencies provides a good basis for using many other procedures.
It is a good idea to close the Output window so that a new one is created for the next Tutorial.
Guide to SPSS for Information Science 43
This Tutorial can be skipped if you are not interested in creating your own data files.
New data files are created by keying data into a new Data Editor window whose title will begin
Untitled...
Below is a small amount of real data selected from the supplied data file DATA04 containing the
results of a survey of UK HEIs about their Open Access Policies. The full dataset has 39 cases
with 26 variables and will be used later in this Guide.
The data here, which has 5 cases and 5 variables, is for you to practise entering and formatting
data. The five cases are:
There are two systematic ways you can enter the above data in Data View:
It is a matter of preference. Provided you have already entered the variables information first, the
advantage of going across is that you can use Tab to move from cell to cell and SPSS will
automatically take you to the first cell of the next case when you have finished entering all the
data for one case.
There are two systematic ways you can enter the above variable information in Variable View:
(1) You can go across entering each variable name and its attributes and then move onto
the next variable.
(2) You can go down entering all the variable names and then go back and edit all the
attributes.
It is a matter of preference, which may depend on how many cases there are, how many
variables there are, and how much variable information has to be added.
In this TUTORIAL we use case by case entry for data and variable by variable entry for variable
information.
It is possible to enter the data before entering all the variable information (SPSS will use dummy
variable names VAR0001 etc. which you can later edit) but it is preferable to enter the variable
information first – especially if there are any Values labels.
then when you click (or double-click) in the cell a drop-down menu will be
made available (by clicking on the down arrow)
and you can select the entry you want rather than needing to type it in
1. If the Data Editor window is already visible and is empty (as it will be if just starting SPSS)
then you can proceed straight to step 5. Otherwise, first create a file as follows:
► A Data Editor window with title beginning Untitled will appear (see below).
5. Make sure you are in Variable View by clicking the button at the bottom left corner.
Guide to SPSS for Information Science 45
6. In row 1 type in the first variable name INSTITUTION into the first cell of the column headed
Name and press Enter or or Tab on your keyboard.
► SPSS has filled in defaults for many of the variable‟s attributes. Most are OK but some
of these need changing or more information inserted.
8. Select „String‟.
9. Click OK.
► SPSS automatically sets INSTITUTION‟s Decimals cell entry to „0‟ as it is a string variable:
► All the other SPSS default entries for INSTITUTION are OK.
► The information for the remaining variables must now be entered in a similar way.
1 = RLUK member
2 = Other Pre-1992 university
15. Insert Values as in this table 3 = Post-1992 university
4 = HE college
Do this as follows.
16. Insert „1‟ for the Value and „RLUK member‟ for the
Label and click Add.
17. Insert „2‟ for the Value and „Other Pre-1992 university‟
for the Label and click Add.
18. Repeat the process for the other two labels and click
OK.
19. In row 3 type in the Name ‘Q1‟ and set the Decimals to „0‟ and type in the Label „Written OA
policy‟, set Align to „Left‟ and set the Measure to „Nominal‟.
21. Insert „1‟ for the Value and „Yes‟ for the Label.
23. Insert „2‟ for the Value and „No - planned‟ for the Label.
24. Click Add, repeat the process for the other three labels, then click OK.
► This inserts all the value labels into the data file.
► The Values entry for Q1 in the Data Editor will now show the first entry (all
the value labels can be seen by clicking on the three dots).
Guide to SPSS for Information Science 47
26. Change Type from „Numeric‟ to „Comma‟, and set Decimals to „0‟.
► „Comma‟ is the same as „Numeric‟ but inserts commas to mark off every three digits
before the decimal point, to aid legibility.
► Setting the number of decimal places can be done within the Variable Type dialog box at
the time „Comma‟ is selected, or done separately.
To do this use Edit Options Currency and select CCA and enter „£‟ in the prefix
box (if not already there).
► This establishes the Custom Currency type CCA as £ which can be selected as required.
► This presents an empty array of cells with the five variable headings at the top.
► Before proceeding to entering the data, a summary of ways to move around the Data
Editor window is presented, for easy reference.
• Tab press moves the cursor from one data cell to the next cell to the right until the last
variable is reached at which point the next TAB press moves the cursor to the first cell of
the next row. This is particularly useful when in Data View for entering data case by
case.
• Vertical and horizontal scroll bars are located at the right hand side and bottom and can
be used to move around the window.
• keys can be used to move from cell to cell, instead of „point and click‟.
• Control + Home key combination takes you to the first cell (top left) in the window.
• Control + End key combination takes you to the last cell (bottom right) in the window.
To move around the Data Editor window in Data View mode only:
• Control + key combination takes you to the first cell in the current column.
• Control + key combination takes you to the last cell in the current column.
• Control + key combination takes you to the first cell in the current row.
• Control + key combination takes you to the last cell in the current row.
Guide to SPSS for Information Science 49
34. Enter the data shown in this table, but first read the notes below.
► Do not enter the £ signs (they are provided by the Custom Currency Type attribute).
► The best way to enter the data is to type in the first entry then press Tab to move across to
the next cell. When at the end of the row Tab automatically takes you onto the first cell in the
next line.
► Note that SPSS has inserted a dot where a value was missing. The dot is the SPSS
missing values symbol.
► The alignment for the variables INSTITUTION and SECTOR could be better centred.
SPSS’s default is Left alignment for String and Right alignment for everything else.
35. Save the data using File Save As and type in a name of your choice to a destination of
your choice. (See Reference Section 6 for more details.)
Guide to SPSS for Information Science 50
When data has been entered it needs to be checked. For a large data set this is an important and
considerable task. SPSS has a special procedure (Case Summaries) designed for this which will
be introduced in TUTORIAL T5. For a small data set various other SPSS procedures can be just
as – or even more – effective in looking for mistakes, such as wrong values, missing values,
typing errors in labels and names.
Entering this small data set you probably have made no errors. As a simple check we explore the
data using the Frequencies procedures (met in TUTORIAL T3).
We try to find the frequency tables, mean values and bar charts for all five variables all in one go,
proceeding as follows (you can skip step 1 if continuing straight on from T4.3).
5. Open the Statistics options window and select Mean, Median, Mode, Minimum, Maximum.
6. Click Continue.
► Note that when there is missing data it is excluded from calculations (for the mean etc.).
► Note the difference when the value is not missing but is zero – it is included in calculations.
► Note that the mean is calculated for Q1 [Written OA policy] – although it is meaningless,
because Q1 is defined as a Numeric variable.
► Note that the mean is not calculated for INSTITUTION [Identifier code] – because it is
defined as a String variable.
► Note that the variable labels and not the variable names are shown in this table. This is a
matter of choice, controlled by Edit Options – see Section 2.5.4 on page 9.
Guide to SPSS for Information Science 51
It would be a long and tedious and rather pointless task to enter all the data for
DATA04_OpenAccess_HEIs with its 39 cases each with 26 variables! Fortunately you are
spared this as the dataset is provided for use with this Guide. Here you are encouraged to repeat
the same analysis as was done in T4.4, this time on the full dataset.
3. Click Apply.
4. Click OK.
11. Open the Statistics options window and select Mean, Median, Mode, Minimum and
Maximum.
This is very much an optional TUTORIAL which can be returned to at later time.
It is important to check that data is entered correctly. Any of the various procedures which
produce tables can be used. For example:
In this TUTORIAL you will load a specially constructed small data file and generate a list of
values for variables to look for anomalies.
2. Select Analyze Reports Case Summaries to open the Summarize Cases window.
9. Click Continue.
Guide to SPSS for Information Science 53
10. Click OK to produce two tables in the Viewer Output window, shown below:
► This shows that for some reason the fourth variable is excluded from two cases.
► The table below (shown partially) lists the values of all seven selected variables for all 10
cases. The bottom of the table provides the statistics asked for (Number, Mean, Min, Max).
► The Number, Mean, Min, Max statistics – together with applied common sense – are
very useful for spotting anomalies and errors, but will not find them all, depending on
their nature.
► The reader should do some detective work looking at the full output to see what errors
there appear to be, in conjunction with viewing the actual dataset.
► Finally look at the notes in the table below to see if you spotted all the anomalies.
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
► Use Edit Options to check that in the General tab window the Variable Lists
choices are Display names and File, to match the listing format used in this tutorial.
Change if needed, then click Apply, then OK to close the message box which appears,
and OK again to exit.
4. Click OK.
► This produces a single number – the mean of all the values for that variable. It is
displayed with no decimal places (the Auto default setting in this case):
Guide to SPSS for Information Science 55
► You can close the annoying Custom Tables pop-up dialog – if you first click on the
„Don‟t show this dialog again‟ box it won‟t appear again)
► This produces the mean of all the values for that variable, but now it is displayed with 2
decimal places:
Guide to SPSS for Information Science 56
T6.2 One-variable Frequency Table for nominal data – Count and Total
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Drag the nominal variable Q1 [Written OA policy] to the Rows label rectangle (it will
change to red until the mouse button is released).
7. Click Apply.
8. Click OK to produce the table in the Viewer Output window as shown below.
► Note that by default this procedure produces a Count of the different values for the
variable and we have additionally asked for the Total which appears at the bottom.
Guide to SPSS for Information Science 57
T6.3 One-variable Frequency Table for ordinal data – Count with Subtotals
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
► This variable has four codes (1 to 4) for a range of responses to a question plus two
codes for Don‟t know (5) and N/A (6). It could be useful to have separate subtotals for
the first four and the other two. This is how it can be achieved.
6. In the Define box click on Categories and Totals… which will show the six responses:
9. Click Continue.
12. Click on Add Subtotal… in the Subtotals and Computed Categories box.
13. In the Define Subtotal window change the Label to Total Non-Responses.
► This will insert Total Non-Responses below the value label („Don‟t know‟).
► You may need to enlarge the whole Categories and Totals window (or scroll down in
the Value(s) box) to see this.
15. Ensuring that Total Non-Responses is still highlighted, click the blue down arrow once to
move Total Non-Responses to beneath the last value label („N/A‟).
► Note that the Value(s) column shows the range for Total Non-Responses to be 5…6.
T6.4 One-variable Frequency Table for nominal data – Count with order sorted
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
10. Select Descending in the Sort Categories drop down Order menu.
16. In the Sort Categories Order drop down menu select Ascending.
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
10. Click on Count in the Statistics box (it will be the top entry).
11. Click on the right pointing arrow to insert Count into the Display box.
13. Click on the right pointing arrow to insert Maximum into the Display box.
15. Click on the right pointing arrow to insert Minimum into the Display box.
Guide to SPSS for Information Science 61
17. Click on the right pointing arrow to insert Median into the Display box.
► This produces a set of four statistics for the four values for the HEI group:
► The order in which the variables appear is determined by their order in the Display box,
which can be changed using the up/down arrows. (Repeat Analyze Tables
Custom Tables and click on N% Summary Statistics if interested.)
► „RLUK‟ stands for „Research Libraries UK‟ - this variable indicates whether a library is
a member of this organisation.
► In 1992 many former polytechnics gained university status. Subsequently some other
HE colleges have gained university status. „Post-1992‟ includes the former
polytechnics and those former HE colleges.
► Some HE colleges provide degree level courses, particularly specialist colleges (e.g.
Art, Music).
Guide to SPSS for Information Science 62
T7.2 Two-variable Nested Frequency Table for nominal data – Count & Col%
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
5. Drag nominal variable Q3 [OA material in Library Catalogue] to the Rows label.
7. Click on the column heading „OA material in …‟ in the display icon to select that variable.
10. Click on the right pointing blue arrow to move Col N % into the Display box.
T7.3 Two-variable Two-way Frequency Table for nominal data – Count & Col%
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
5. Drag nominal variable Q3 [OA material in Library Catalogue] to the Columns label.
[This is the difference from T7.2]
10. Click on the right pointing blue arrow to insert Row N% into the Display box.
► Although this table contains exactly the same data as that in T7.2 you may find this
one much easier to understand.
► There is a lot to choosing the right format for tables for the given purpose.
► There are many types of percentages which can be applied to tables. If you have the
time and inclination it can be an interesting challenge to explore them further. Below is
the complete list found in the Statistics box!
Guide to SPSS for Information Science 64
T7.4 Two-variable Two-way Frequency Table for nominal and ordinal data –
interchanging rows and columns – Row%
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
5. Drag nominal variable RLUK [Research Libraries UK member] to the Rows label.
6. Drag ordinal variable Q2a [Self-archiving in HEI's repository] to the Columns label to
produce this table template:
12. Click on the right pointing blue arrow to insert Row N% into the Display box.
14. Click OK to produce the table below in the Viewer Output window:
Guide to SPSS for Information Science 65
16. Right click on the output table template to produce this drop down menu:
17. Click Swap Row and Column variables to produce the reorganized table template:
19. Click OK to produce the table below in the Viewer Output window:
► This is probably not what you would expect! The rows and columns have indeed been
swapped (compare to the table below step 14). However, the Row N% has been
replaced by the default Count. Presumably this is because Row N% might not be what
was now wanted (maybe Column N%?) and SPSS cannot decide.
► You can, of course go back and choose whatever you require, by selecting
Analyze Tables Custom Tables and clicking on N% Summary Statistics.
Guide to SPSS for Information Science 66
T7.5 Two-variable Nested & Stacked Frequency Tables for nominal and scale
data – Count
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded)
3. Click Reset.
6. Drag scale variable HE_TYPE [Type of HEI] into the table template box and drag it around
over RLUK‟s box and note that it can produce the following:
7. Investigate in turn what these five actions produce. Below are the results you should obtain.
► Which produces a nested table with HE_TYPE on the left and RLUK on the right.
T8.1 Three-variable Frequency Table for two nominal variables and one scale
variable – Mean
1. Load data file: File Open Data DATA04_OpenAccess_HEIs.sav (if not loaded).
3. Click Reset.
7. Drag scale variable Q9e [Running costs - Total (£)] to the Columns label.
► Reminder: „RLUK‟ stands for „Research Libraries UK‟ - this variable indicates whether
a library is a member of this organisation.
► There are empty cells because there were no entries in some categories. However,
the entry of zero for the Post-1992 universities is different – it shows that there was at
least one entry but the mean score was £0.
► The RLUK members with written OA policies spent much more on average than RLUK
members without OA written policies.
► The RLUK members with OA written policies spent more on average than other Pre-
1992 universities and Post-1992 universities with OA written policies.
► It is seen that in this sample the Post-1992 universities spent a lot more than the
specialist HE colleges.
► The order in which the three variables are organised, and whether the table is nested
or stacked or a mixture is controlled by the order in which the variables are dragged to
the Rows label or Columns label or dragged within the table template display box.
There is plenty of room for exploration here!
Guide to SPSS for Information Science 69
► Use Edit Options to check that in the General tab window the Variable Lists
options selected are Display names and File, to match the variable list format used in
this tutorial.
► Use Edit Options to check that in the Output Labels tab window the Pivot Table
Labeling option selected for Variables in labels shown as is Labels, and the option
selected for Variable values in labels shown as is Labels, to match the output display
formats used in this tutorial.
► Be sure to click Apply if you need make any changes above, before clicking OK to exit.
4. Drag nominal variable ft_pt [Full Time or Part time] to the Columns label.
► Varying the order of dragging in the variables (or positioning them in the rows and
columns or within the table template) will produce different presentations of the data
(for the interested reader to investigate.)
► Using Edit Options to change the Output Labels tab window Pivot Table
Labeling options for Variables in labels ... and for Variable values in labels ... will
produce different output display formats (for the interested reader to investigate.)
Guide to SPSS for Information Science 70
T8.3 Four-variable Frequency Table for three nominal variables and one scale
variable – Median and Mode
► Use Edit Options to ensure that in the Output Labels tab window the Pivot Table
Labeling option for Variables in labels shown as is Labels, and the option for Variable
values in labels shown as is Labels, to match the output display formats used here.
4. If variables are already selected click Reset and click All Tabs.
5. Locate and drag scale variable modules [Number of modules] to the Columns label.
6. Click on N% Summary Statistics in the Define box and remove Mean from the Display
box (by selecting Mean and using the blue arrow) and replace it with Median and Mode
(along the lines of what was done in TUTORIAL T7.1).
10. Drag nominal variable ft_pt [Full time or Part time] to the Rows label.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded)
4. Click on Options.
5. Select Skewness
(this is a measure of how „flattened‟ to one side is the distribution
of the data. N.B. positively skewed means having a longer tail
to the right).
► The N column indicates that there is one missing value for Module C, i.e. 149 valid entries.
► Module D marks are very different from the others – with a much lower mean, a much
larger standard deviation and a much wider range (Max – Min).
► The standard deviations are similar for Modules A, B and C, so they could be combined,
and a parametric test such as a t Test (or ANOVA) could be used to compare them.
► Significant skewness occurs when the Skewness Statistic lies outside +/- 2 x Std. Error.
Guide to SPSS for Information Science 72
3. Use the blue arrow to move the variable modules into the Dependent List box.
► This variable records the number of modules (out of a maximum of 12) about which
each student accessed information on the institution‟s VLE.
5. Click Continue.
► The table contains many basic statistics about the chosen variable:
► The stem-and-leaf plot (below left) is a cross between a table and a chart.
► The box plot (below right) illustrates the spread of the data – the box itself contains the
middle 50%.
Guide to SPSS for Information Science 73
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
► Gallery should be
highlighted (if not, then
select it).
6. Close (or just ignore) the Element Properties window which appears.
► Element Properties… is available using a button in the Chart Builder window located
on the far right below the Chart Preview box.
► At this point Chart Preview will show a generic bar chart (as illustrated below).
7. Locate usage_level in the Variables list and drag it across to the X axis box whose label
„X-Axis?‟ has a questionable question mark.
3. Click on Bar and drag the „Simple Bar‟ icon from the Gallery into the Chart Preview box.
10. Click Reset (otherwise SPSS will still think the variable is scale).
11. Click on Bar and drag the „Simple Bar‟ icon from the Gallery into the Chart Preview box.
12. Locate modules in the Variables list and drag it across to X-Axis?.
► Use Edit Options to check that in the General tab window the Variable Lists
choices are Display names and File, to match the list format used in this tutorial.
2. Select Graphs Chart Builder... and click on Reset if a chart is already selected.
4. Drag the „Simple Bar‟ icon from the Gallery into the Chart Preview box.
► Note that the Y axis label „Count‟ is quite small (we will enlarge it a bit later).
Every element and aspect of this chart can be edited and further elements can be added:
X-axis and Y-axis labels and their fonts, font sizes, font styles, colours.
Y-axis increments, tick positions, font sizes.
Bar colour.
Bar label wording, font, font size, font style, colour.
Chart background colour.
Orientation.
Guide to SPSS for Information Science 77
To edit a chart double-click on it. This opens the Chart Editor window. Once in Chart Editor
there are several ways to set about editing an element of a chart. The two most used are:
► This opens the Chart Editor window and may open a Properties menu (if so, close it).
13. To edit the X-axis label („Programme‟) in the same way: click on the label to select it.
► The relevant Properties window should still be open (if not, use Edit Properties).
15. To change the font size of the label, click on the Size drop-down menu and choose „16‟.
16. To check the style of the label, click on Style drop-down menu (default „Bold‟).
17. To change the colour of the label, click on a colour in the Color palette (e.g. bright red).
► The word „Title‟ should appear centrally above the chart (drag away the Properties
window to see it, if necessary).
20. Select the word „Title‟ by dragging across it and change it to „Students on Programmes‟.
21. Double-click on the title, if necessary, to open the Properties window shown below.
► The inserted line should still be highlighted (yellow rectangle round it), if not click on it.
25. To add a small text box to the explain what the reference line represents use
Options Text Box
26. Drag the yellow box down lower if it is obscured by the title. Select the word „Textbox‟ and
change it to „MEAN‟.
28. Drag it into position on the reference line and click elsewhere to embed it.
Guide to SPSS for Information Science 81
29. To add markers to the individual bars to show their values, click on a bar (so that all the
bars are highlighted with pale yellow rectangles) and select Elements Show Data
Labels.
34. Finally, click on the Chart Editor window‟s close box to shut the Chart Editor window
and embed the chart into the Viewer Output window. The next page shows how it should
look.
Guide to SPSS for Information Science 82
Guide to SPSS for Information Science 83
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
2. Select Graphs Chart Builder... and click on Reset if any chart is already selected.
4. Drag the „Clustered Bar‟ (second icon across) from the Gallery into the Chart Preview box.
7. Locate gender in the Variables list and drag it across to Cluster on X: set color (top
right corner).
8. Click OK to generate a
Clustered Bar Chart
► This may also open the Properties window. If not, select Edit Properties.
► Note that the labels for the programmes are a bit jumbled because they are so long.
► This could be overcome by transposing the chart by clicking the Toolbar icon
but this will not be done just now (you can try it, if so click again to undo it).
11. Select the Bar Options view and change the width
of Bars to „80‟ and the width of Clusters to „75‟.
Either use the sliders or type in the numbers.
Guide to SPSS for Information Science 84
12. Click Apply if using the sliders, or if typing in the numbers you can press Enter instead.
► This will slim down the bars and slightly separate the bars for the males and females.
13. Click on the blank area inside the chart rectangle to select the whole „inner frame‟.
► A pale yellow rectangle will surround the „inner frame‟ containing the bars, the axis
labels and the legend.
15. Ensure Fill & Border view is selected and click on the small Color Fill box and then click
on „yellow‟ for the background.
► This heightens the bars a little and puts more ticks on the Y-axis, to aid reading off values.
[WARNING CONCERNING STEPS 21 to 23 BELOW: INSERTING MINOR TICKS DOES NOT SEEM TO WORK IN
SPSS 19.0 ALTHOUGH IT DID IN SPSS 18.0. IT IS INCLUDED HERE – TRY IT – THE BUG MAY BE FIXED SOON!]
24. Click on the Gender Legend„s Male small coloured box to select it (a pale yellow
rectangle will appear round it).
Guide to SPSS for Information Science 85
► This highlights the Y axis again (a yellow rectangle will appear round it).
32. As the programme names are long the labels may be squashed up and if so it can be
better to transpose the axes. This is very easily achieved:
33. There is an alternative way to improve the label legibility, which may be preferred.
34. Select Edit Select Chart and close the Properties window (if it is still open).
► A pale yellow rectangle should enclose the chart, with corner and side „handles‟ which
can be dragged to resize the chart.
► Alternatively, close the Chart Editor to embed the chart in the Output window and
click on the chart and resize it there.
Guide to SPSS for Information Science 87
35. Either drag a handle to the right to stretch the chart and separate the labels or drag to the
left to narrow the chart and force the labelling to go sideways. The two effects are shown
below:
Guide to SPSS for Information Science 88
Using the data from T14.1, the question, whose responses are to be illustrated here, asked
students if they would be happy to miss lectures if lecture notes were available on the VLE.
3. Drag the „Clustered Bar‟ icon from the Gallery into the Chart Preview box.
4. Locate miss_lectures in the Variables list and drag it across to the Y-Axis.
6. Click Apply.
7. Click Close.
9. Locate ug_pg in the Variables list and drag it across to Cluster on X: set color.
If you are continuing straight on from T14 you can skip straight to Step 2.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
2. Select Graphs Chart Builder... and then click on Reset if continuing from T14.
4. Drag the „Stacked Bar‟ (third icon across) from the Gallery into the Chart Preview box.
7. Locate gender in the Variables list and drag it across to Stack: set color.
The changes made to the Simple Bar Chart (T12, T13) and Clustered Bar Chart (T14) apply also
to the Stacked Bar Chart, which we will not illustrate here. We will only make one (new) change.
► To undo this, and revert to the normal Stacked Barchart, either select
Options Scale by value or click the Options Toolbar icon again. (Try this and then
revert to the Percentage Stacked Bar Chart before continuing.)
Guide to SPSS for Information Science 90
10. Before finishing, we will remove the unnecessary decimal places which will appear (by
default) on the Y-axis.
Do this by
Edit Select Y-Axis which
will open this Properties window
The LISU/SQWconsulting Open Access Report for RCUK (see Appendix) includes this double chart:
Guide to SPSS for Information Science 91
Here we will explore how to produce a chart to closely resemble one of these, using the
supplied subset of the Open Access data. (As it is a subset of the data, the chart will not be
exactly the same, but, provided the subset is representative, it should be similar.)
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Drag the „Stacked Bar‟ icon from the Gallery into the Chart Preview box.
7. Locate Q03 in the Variables list and drag it across to Stack: set color.
► The Y-axis will show „Count‟ as the variable (i.e. the frequency).
9. Click OK.
21. Click on the Legend to highlight the words next to the three squares
25. This converts the chart to percentages. Remove the superfluous decimals either by
selecting Edit Select Y Axis or by clicking on the Y icon.
29. Close the Properties window and check the chart. You will see that order of the bars is
now the opposite of what it was (HE college was last but now is first). To rectify this either
select Edit Select X Axis or click on the X icon.
► Further editing could be done to more closely mimic the LISU/SQWconsulting chart
shown at the beginning.
Guide to SPSS for Information Science 95
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
4. Drag Drag the „Simple Histogram‟ (first icon) from the Gallery into the Chart Preview box.
► Note that the Element Properties window which opens shows that the Statistic is set
to the default „Histogram‟ – this means that the y-axis will show frequencies. If you
change this to „Histogram Percent‟ then the y-axis will show percentage instead.
► As with Bar Charts, every element and aspect of this Histogram can be edited and
further elements can be added.
Guide to SPSS for Information Science 96
11. Hide the normal curve again using Elements Hide Distribution Curve.
12. To alter the Histogram to have lines rather than bars first click on
the bars and then select Options Un-bin Element.
18. Close the Chart Editor to embed the revised Histogram in the Viewer Output window
(see below).
► The data analysed here has a small range of values (the integers 0 to 12). Altering the
binning is more useful when there is a large range of possible values.
► If you were to enter „13‟ in Number of intervals then the result would essentially be a
Bar Chart as there are 13 possible values: 0, 1, 2, ...12. (Try it.)
Guide to SPSS for Information Science 98
We now turn briefly to another much less well-known form of the Histogram.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
3. Click Reset.
5. Drag the „Stacked Histogram‟ icon from the Gallery into the Chart Preview box.
6. In the Element Properties window which opens change the Statistic to „Histogram Percent‟.
7. Click Apply.
9. Locate ug_pg in the Variables list and drag it across to the Stack: set color box in the
Chart Preview box.
► This chart shows the contributions separately for undergraduates and postgraduates.
► Sometimes such a chart will reveal marked differences, but not in this case, with both
types of student having broadly the same shaped distribution.
► It is equivalent to the Stacked Bar Chart for nominal and ordinal data.
► Usually the data set would have more values than in this example.
Guide to SPSS for Information Science 99
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
3. Click Reset.
5. Drag the „Frequency Polygon‟ icon from the Gallery into the Chart Preview box.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
4. Drag the „Population Pyramid‟ icon from the Gallery into the Chart Preview box.
Guide to SPSS for Information Science 100
5. Locate modules in the Variables list and drag it to the Distribution Variable? box.
6. Locate gender in the Variables list and drag it to the Split Variable? box.
7. Click OK.
► This is essentially the Stacked Histogram met in T16.2, broken into its two constituent
parts and set out sideways…
► Usually the data set would have more values than in this example.
► Below is a different looking Population Pyramid of Recommended Retail Price for the
UK‟s 100 top-selling books, obtained from the data file: DATA01_100Books.sav
provided to support this Guide.
Guide to SPSS for Information Science 101
4. Drag the „Pie Chart‟ icon from the Gallery into the Chart Preview box.
6. Locate miss_lectures in the Variables list and drag it across to the Slice by? box.
17. Select Text Style view and change Size from ‘Automatic’ to „12‟.
20. Select Text Style view and change Size from ‘Automatic’ to „14‟.
► Note that the default Sort by „Value‟ here means sort by the value of the codes used
for the five possible responses (1 = „Disagree strongly‟, etc.), not by size of slices.
28. It can sometimes be best to have the slices in descending order of size of slice (clockwise
starting from the ‟12-oclock‟ position), although it‟s not appropriate here.
Purely for illustration, to achieve this proceed as follows: in Chart Editor select the Pie
Chart by clicking on it in the middle.
32. Remove all the data labels, in Chart Editor select Elements Hide Data Labels.
34. To „explode‟ the whole Pie Chart, first select the Chart (either click on it in the middle or
use Edit Select Chart) and then choose Elements Explode Slice.
35. To put the Pie Chart back together, choose Elements Return Slice
► Alternatively, click again on the pie slice icon in the Elements Toolbar.
36. To „explode‟ one slice (e.g. the smallest), select the slice by clicking on it (ensuring it
alone has a pale yellow outline round it) and choose Elements Explode Slice (or
click on the Toolbar icon).
► The above actions can produce effects such as those shown below:
Guide to SPSS for Information Science 104
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
3. Click on Gallery (if not already highlighted in yellow) and click on Line.
4. Drag the „Simple Line‟ icon from the Gallery into the Chart Preview box.
14. Close the Chart Editor window to embed this Line Chart in the Viewer Output window.
We now refine the analysis to compare month of launch of Hardback and Paperback books.
We can do this by modifying the current Chart Builder contents rather than starting afresh.
16. Drag the „Multiple Line‟ icon from the Gallery into the Chart Preview box to replace the
„Simple Line‟.
17. In Chart Builder locate Binding in the Variables list (near the bottom) and drag it to the
Set color box within the Chart Preview box.
► This will make the number of Hardback books launched in a month into a percentage
of all Hardback books for the 12 months (and likewise for Paperbacks). This „evens
out‟ the difference in the total numbers of Hardback and Paperback books, to allow a
proper comparison.
23. Click OK to exit Chart Builder and embed the Line Chart in the Viewer Output window.
► This Multiple Line Chart (below) shows that there is a clear difference in the launch month
profiles of top-100 books for the two bindings.
► Questions to ask are: Is this a coincidence? Is it true more generally across all books?
Has there been a change over time? Is there a logical explanation?
Guide to SPSS for Information Science 107
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Drag the „Simple Scatter‟ icon from the Gallery into the Chart Preview box.
10. Select Elements Fit Line at Total to obtain the regression line („line of best fit‟).
2
R is a measure of how much one
variable „determines‟ or „predicts‟ the
value of the other variable – in this
case it is very high at 86%.
11. Close the Chart Editor window to embed this Line Chart in the Viewer Output window.
We now refine this analysis to separate out the Hardback and Paperback books.
13. Drag the „Grouped Scatter‟ icon from the Gallery into the Chart Preview box.
14. Close the Element Properties window which will have opened.
15. Locate Binding in the Variables list and drag it across to the Set color box.
18. To get separate regression lines for the two bindings select
Elements Fit Line at Subgroups.
2
► SPSS produces the two lines as required, and R values for each.
23. In the Properties window select Marker view (if not selected) and change the Type from
„circle‟ to „square‟ and click in the Fill box and then click on „blue‟.
► Note that SPSS has also filled in the Paperback red circles which we did not ask for (a
minor bug!). To overcome this, proceed as follows:
25. Click on the red circle next to „Paperback‟ which will highlight all the red circles.
26. In the Properties window select Marker view (if not selected) and click in the Fill box and
then click on „white‟.
To edit the separate regression lines for the two bindings proceed as follows:
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
4. Drag the „Simple Boxplot‟ icon from the Gallery into the Chart Preview box.
► The 1-D Boxplot would produce the same output as this does here.
► For a normal distribution the distance from Median to Inner fence is approximately
equivalent to 3 standard deviations, enclosing virtually all the values.
Guide to SPSS for Information Science 112
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not loaded).
4. Drag the „Simple Boxplot‟ icon from the Gallery into the Chart Preview box.
6. Locate modules in the Variables list and drag it across to the Y-Axis? box.
7. Locate ug_pg in the Variables list and drag it across to the X-Axis? box.
► There are several outliers and the circle outlier 117 is right on the
zero line. We will change the origin to reveal it properly.
10. Select the Y-axis either by Edit Select Y-Axis or by clicking here:
► It may seem quite unnecessary to edit the Major increment (which process removes
the tick from its Auto box) but if you don‟t do anything then Auto changes the „2‟ to
„2.5‟ when you click Apply, which spoils the neatness of the Y-axis scale. (However,
instead of editing the Major increment you could in this case just deselect Auto
instead.)
14. Close Chart Editor to embed the chart in the Viewer Output window (see next page).
Guide to SPSS for Information Science 113
More complex boxplots can be easily produced (as below), but this will not be pursued here.
Guide to SPSS for Information Science 114
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Click OK.
► This produces the Report below which gives the IT Piracy statistics broken down by
Region, showing means and standard deviations:
► If more variables are added to the Dependent List then the table is set out differently
with one column per variable, as below:
► If more variables are added to the Independent List then more tables are obtained.
Guide to SPSS for Information Science 115
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
► The correlation obtained is reported as significant at the 1% level (p < 0.01). However,
the relationship is not very strong.
► The correlation is negative indicating that higher GDP is associated with lower IT Piracy
Rate.
Guide to SPSS for Information Science 116
6. Move the scale variables Piracy_Value_2008 into the Variables box, to join the other two
variables already there.
► The correlations obtained are all reported as significant at the 1% level (p< 0.01) shown by
** next to the value, or significant at the 5% level (p<0.05) shown by * next to the value.
► The correlation for GDP against Piracy Value is very high and positive.
► It is perhaps surprising at first that the correlation for Piracy Rate against Piracy Value
is negative. We will investigate this further ...
9. Move the variables Piracy_Rate_2008 and Piracy_Value_2008 into the Variables box.
► Now that the effect of GDP is removed we see that the (partial) correlation of Piracy
Rate against Piracy Value is positive.
Guide to SPSS for Information Science 117
► The correlations obtained are similar to but somewhat higher than, those for Pearson.
► Spearman does not have these conditions, and is a nonparametric statistic, and so is to
be preferred if the Pearson conditions are not met.
► In practice, the two correlations usually lead to the same broad conclusion.
Guide to SPSS for Information Science 118
4. Click OK.
► This time we will use the various options buttons (top right of the Crosstabs window) to
investigate the differences further.
3. Click Continue.
Guide to SPSS for Information Science 119
4. Click on Cells and click the Expected tick-box in the Counts section and the Adjusted
standardized tick-box in the Residuals section.
5. Click Continue.
6. Click OK to obtain this augmented crosstabulation table in the Viewer Output window
below:
► The table above contains the actual Count information, as in the previous table, but also
includes the Expected Count – i.e. the frequencies which one would expect if there were
no difference between the genders (no „bias‟ one might say). Clearly some differences
will happen by chance and the „significance‟ of the size of each individual difference will
depend upon the size of the frequency.
► The Adjusted Residual signifies the level of importance; any value of the Adjusted
Residual outside +2 to – 2 is considered „significant‟. (N.B. Here „significant‟ is the
ordinary English usage of the word which is not the same as „statistically significant‟ as
used below.)
► The Chi-Square Tests table below reports indicates if overall the differences are
statistically significant by giving the probability of differences this large or larger occurring
by chance.
Guide to SPSS for Information Science 120
► The Chi-Square Tests table shows that in this case the probability of differences this large
or larger occurring by chance is 0.003. This is much smaller than the normal 0.05 criterion
level used (95% significance) so there is very strong evidence of a difference between the
genders.
► The „small print‟ beneath the Chi-Square Tests table is important. It shows that the
conditions of the Chi-square test have been met so the test is valid.
A Chi-square Test report may confirm the existence of a statistically significant association but it
will not indicate how strong this is or which cells are most deviant.
The Crosstabs table provides is some evidence because the difference between a cell‟s Count
(aka Observed count) and its Expected count does show the deviation. But it can be hard to
analyse or interpret this. By adding an extra optional line – showing the Adjusted Residuals – we
can get some help.
Any Adjusted Residual which exceeds +2 or is less than –2 is an indicator of a marked deviation –
the sign indicates the direction. The bigger this is the more important a contribution it makes.
The Chi-square test allows us to determine whether or not there is a statistically significant
association between two variables such as in a Crosstabs table.
The test provides a significance value for the association as a probability (p) – it is called Asymp.
Sig. in SPSS Crosstabs output.
To be a significant association (and not a product of random chance) small values of p are
needed. Normal choices are:
95% level: p < 0.05 99% level: p < 0.01 99.9% level: p < 0.001
The choice of level depends on how confident you want to be before declaring the existence of an
association; 95% is commonly used. (It could be quite a weak association.)
If p is larger than the chosen significance level then the variables are said to be statistically
independent.
The SPSS Crosstabs procedure generates a Chi-Square Test report, showing the significance
level:
For a table larger than 2 by 2 (i.e. more than 4 cells) look for the line:
For a 2 by 2 table (i.e. one with 4 cells) look for the line below that:
Chi-square is not valid if the Expected cell counts are too small – caused by having too few
cases or too many cells. The criteria which must be met to use Chi-square are:
SPSS automatically prints out the relevant information below the Chi-Square Tests report table.
Remember to check this before drawing any conclusions!
3. Click Reset.
6. Click OK.
7. Click Continue.
8. Click on Cells and click the Expected tick-box in the Counts section and the Adjusted
standardized tick-box in the Residuals section.
► The Pearson Chi-Square row shows the result to be significant at the 95% level since
p = 0.013 (i.e. p <0.05) BUT there is a problem because the „small print, beneath the
table shows the test to be invalid. As it fails to meet the criteria for larger than 2 by 2
tables:
No more than 20% of cells may have a count less than 5 – here it is 25%.
► It is clear where the problem lies – only 1 entry for „Less than once a month‟.
Guide to SPSS for Information Science 123
► The solution is to combine categories together – in this case combine it with the previous
category to make a new category of „Less than twice a month‟.
► This requires using the Recode procedure which was explained in Reference Section 9.2 on
page 29.
10. It is important to be sure what the current codes are. Depending on the Output option
settings, the labels rather than the values will appear – as in the case here. To find the
codes, select Variable View and click on the three dots in the Values column for
variable for usage_level. This is what will appear:
► We must recode 4 3 (reducing the categories by one), and can keep the remaining
categories all the same. Do this as follows:
13. Within the Recode into Different Variables window, locate usage_level in the variable
list on the left and move it into the Input Variable Output Variable box on the right.
Then type into the Name box the name for the new variable (usage_code).
Then type into the Label box a suitable label for the new variable (New usage categories).
16. To recode 4 3, enter „4‟ in the Value box in the Old Value section and enter „3‟ in the
Value box in the New Value section.
18. It is VERY IMPORTANT to tell SPSS to keep all the other values otherwise they will be lost.
Do this as follows:
Click on the All other values radio button and click on the Copy old value(s) radio button.
► The report, confirming the recoding, will appear in the Viewer Output window.
Guide to SPSS for Information Science 125
► The attributes of the the new recoded variable usage_code need attending to so that
they match those of the usage_level variable from which it has been derived. Do this as
follows:
23. Locate the row for usage_code (it will be the last variable) and change the Decimals to „0‟.
24. Copy the Values entry for usage_level into that for usage_code Values.
► It is now possible to repeat the Chi-square test using the new recoded variable.
► In the first table (below), 4 of the 6 Adjusted Residuals are larger than „2‟, indicating that
Males and Females differ markedly in reporting their level of regular usage.
► In the second table (below), the Pearson Chi-Square row shows the result to be
significant at the 99% level since p = 0.008 (i.e. p <0.01) and checking the „small print‟
beneath the table shows the test to be valid. It meets the criteria for larger than 2 by 2
tables:
No cell may have an expected count less than 1 – the minimum is 15.79.
No more than 20% of cells may have a count less than 5 – here it is 0%.
Guide to SPSS for Information Science 127
► We conclude that there is a significant difference in Male and Female‟s responses to this
question i.e. there is an association between gender and „usage‟ (as measured here).
► There is still one question to ask: How strong is the association between the two?
► To answer this we turn to the third table which was generated because we ticked the Phi
and Cramér’s V box.
► The crosstabs table we have analysed is bigger than 2 by 2 (it was originally 4 by 2 and then
recoded to 3 by 2). Therefore the relevant test of association is Cramér‟s V. Its value is a
measure of strength of association. (It‟s Approx, Sig. is the same as for Pearson Chi-
square in the second table, and can be ignored.)
► Cramer’s V varies between 0 and 1 with 0 indicating no association and 1 indicating perfect
association (very similar to correlation). Here the value is 0.255 – a moderate level.
Getting these two statistics is an option in Crosstabs, and produces a table entitled Symmetric
Measures.
These (normally) have absolute values between 0 and 1, with 1 indicating perfect association and
0 signifying no association whatsoever.
Strictly speaking Phi and Cramér’s V are designed for nominal (categorical) data, and other
statistics are more specifically designed for ordinal data, but few take notice of that!
In summary, Chi-square tells you whether the table values could be due to chance - and
the variables are independent – or, alternatively, that there is good evidence of some real
association between the two variables.
The larger is N, the smaller is the level of association which can be detected. So Chi-
square can deliver a statistically significant verdict but it might not be a significant one!
Guide to SPSS for Information Science 128
We will use the following table for the analysis, and need to enter some of this data into SPSS.
Before entering any data, however, it is best to get to understand the table needed (and this is
not obvious!).
First we decide on numeric codes to represent Male/Female and Yes/No. This is arbitrary, and
we will use:
There are four cells in the above table which contain the essential information:
Using the chosen codes (values) instead of the labels this becomes:
This is the set of numbers to be entered, each cell being one case. The order does not matter
but it is best to be systematic.
The column above containing the frequencies will be used as weights (to be explained).
3. In row 1: for Name enter „GENDER‟, for Decimals enter „0‟, for Label enter ‘Gender of
student‟.
4. In row 1: open the Value Labels window and enter „1‟ for
„Male‟ and „2‟ for „Female‟.
7. In row 2: open the Value Labels window and enter „1‟ for
„Yes‟ and „0‟ for „No‟.
9. In row 3: for Name enter „FREQ‟, for Decimals enter „0‟, for Label enter ‘Weight‟ and for
Measure choose „Scale‟
► However, if the View menu option Value Labels has been selected then the labels
rather than the codes will be shown (look in the View menu to change this if you wish).
Guide to SPSS for Information Science 130
20. Move the variable CWK and move it into the Column(s) box.
► The crosstabs table below is essentially the same as that given at the start of this
TUTORIAL. The difference is that here „No‟ come before „Yes‟ because the chosen code
for ‟No‟ is numerically less than that for „Yes‟.
Guide to SPSS for Information Science 131
► The footnote „a‟ shows that the test satisfies the criteria for validity.
► The footnote „b‟ alerts us to the fact that because it is a 2 by 2 table an extra row called
Continuity Correction has been included. In such cases it is this row which provides the
correct significance level (Asymp. Sig.), not the first row (Pearson Chi-Square). The
value is p = 0.92 which is not statistically significant because p > 0.05.
24. VERY IMPORTANT Having finished this analysis, which used weighted data, it is
essential to turn off the weighting otherwise it could cause false results in future analyses.
Do this as follows:
26. Click on the Do not weight cases radio button. (This step is essential, whereas step 25 is
optional but done for completeness.)
► The „Weight on‟ advisory message will disappear from the bottom of the Data Editor
window.
Guide to SPSS for Information Science 132
This test analyses a variable and compares its value frequencies with some predetermined
frequencies to see if the proportions are more-or-less the same, or differ significantly.
For example, the ACORN profile system for SOCIAL CLASS developed by CACI Ltd has the
following broad classes:
[See details for this classification system and recent results – per businessballs.co.uk - in the
final section of the Appendix to this Guide.]
The question we address is: „How does the university student population compare?‟
In the data file DATA03_LSquestionnaire.sav are the responses by 150 students to the
question named usage_level:
Firstly, we test to see if the numbers of males and females is about equal or differs significantly.
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Check that the Expected Values radio button for All categories equal is selected.
5. Click OK.
Secondly, we test to see if the distribution of responses to the question on VLE usage is uniform
(that would be about 35 to 40 for each answer – very unlikely in this case of course)
9. Check that the radio button for All categories equal is selected (for Expected Values).
► The results in the Viewer Output window show that the proportions are very unevenly
distributed (with Expected 37.5 for each) and Asymp. Sig. is 0.000, so p < 0.0005.
► As p < 0.05, the result is statistically significant at the standard 95% level (in fact,
because here p < 0.001, the result is statistically significant at the 99.9% level).
which is 0.00000000000016033822461212044.
This continues on using the dataset from T26.2. We test to see if the distribution of responses to the
question on VLE usage is similar to the results obtained the previous year when there were 132
student responses, as shown in the table below:
5. Click Add.
7. Click OK.
► The Observed N are the frequencies for the four possible values of usage_level.
► The Expected N are the frequencies for the values of usage_level if the same
proportions as last year had occurred again.
► The reason why these Expected N are not the four numbers typed in is because they
have been scaled up (each one multiplied by 150/132) so their total is also 150.
► As p < 0.05 the result is statistically significant at the 95% level (but not at the 99% or
99.9% levels).
► Interestingly, if the „7‟ were „6‟ instead then the result would NOT be statistically
significant at the 95% level (try it and see).
Guide to SPSS for Information Science 137
(a) Comparing two sample means from two different groups to infer if the populations from
which they came differ.
This is the Independent Samples t Test.
(b) Comparing the sample mean taken from one group with some specified mean value to
infer if the population mean differs from that specified.
This is the One Sample t Test.
(c) Comparing two sample means from one group under two different circumstances
(„treatments‟ or „conditions‟) to infer if there is an underlying difference.
This is the Paired Samples t Test.
The Independent Samples t Test and the Paired Samples t Test are very commonly used.
The samples are random and independently selected from the parent population(s).
The parent population(s) have equal variances (for the Independent Samples Test).
The data is scale (i.e. interval or ratio) from a continuous normal distribution.
Within the SPSS Independent Samples Test there is a statistical test to determine
whether the amount of deviation from equality of variance is acceptable or not.
A test of normality can be done „by eye‟ using the Frequencies procedure to draw a
Histogram with a normal curve superimposed.
A statistical test of normality can be carried out using the Kolmogorov-Smirnov Test
(covered in TUTORIAL T30).
Warning Note:
As was said above, the t test looks at samples to infer information about populations from
which the samples are drawn. So, strictly speaking, if you have a class of students and test them
on their verbal ability to see if overall the females out-perform the males then no t test is needed.
Just look at the means!
However, if you consider that your students are a representative sample of a (well-defined)
wider population of then a t test could be applied, as there is an inference to make about the
wider population.
In this TUTORIAL, to keep things simple, we are not drawing a clear distinction between sample
and population. We are concentrating on the SPSS procedure.
Guide to SPSS for Information Science 138
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
7. Click Continue.
► The codes for gender now appear in the Grouping Variable box:
8. Click OK.
► The first table gives basic descriptive statistics. It shows the means differ, but not by
much. It shows the two Std. Deviation values (which are the square roots of the two
variances) to be almost the same.
► The t value (-0.224) has a Sig. (2-tailed) value of 0.823 (what we call p). So in this
case p > 0.05, and the difference in means is NOT statistically significant.
► The 95% Confidence interval shows the likely range of values for the difference in
means. Notice that the range includes zero, showing that no difference is a real
possibility. If the 95% Confidence Interval did not include zero then there would be a
statistically significant difference.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open)
5. Click OK.
► The first table gives basic statistics. It shows the mean is 6.32 which is close to 7.
► The second table provides a Sig. (2-tailed) value of 0.001 which is statistically
significant (p < 0.05). If this were a sample then the conclusion would be that the
population mean was not 7.
► Notice that here the 95% Confidence Interval does NOT include zero for the difference
between the means.
Guide to SPSS for Information Science 141
Here we test if the Average Selling Price (ASP) of the Top-selling 100 books is significantly different
from the Recommended Retail Price (RRP).
► The Paired-Samples T
Test window opens
5. Click OK.
► The first table shows the pairs of variables being compared and their means etc. It can
be seen that the Means are quite different and Std. Deviations are very different.
► The second table is an unexpected bonus! We did not ask for the correlation but we
got it all the same. We can see that the variables have an extremely high positive
correlation.
► Its Sig. of 0.000 means p < 0.0005 so the correlation „definitely‟ isn‟t zero, (or rather,
expressing it more precisely, if this were a sample then the parent population‟s
correlation couldn‟t be zero).
Guide to SPSS for Information Science 142
► The third table gives the results. The important value is in the Sig. (2-tailed) column.
► The means of ASP and RRP are „definitely‟ different. In the table Sig. (2-tailed) =
0.000 which means p < 0.0005 so p < 0.001 which gives 99.9% significance level.
► Double-clicking on the table to activate it (for editing) and then double-clicking on the
Here we will look at some ordinal data derived from responses to four questions about a university‟s
VLE, each answered on a 5-point scale. We should really only use scale data but it illustrates the
procedure well.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open).
► We enter a second
pair to test at the
same time.
7. Click OK.
► The first table produced (shown below) displays the pairs of questions being compared
and their means.
► It can be seen that for the first pair the Means and Std. Deviations look very similar, but
not so for the second pair look.
► The second table provides correlations. We can see that the first pair have a
moderately strong positive correlation (0.495). It‟s Sig. of 0.000 implies p < 0.0005
which means it „definitely‟ isn‟t from a population with a zero correlation. In contrast the
second pair have a negligible correlation – which may well be considered zero.
► From the Pair 1 correlation (+0.495, Sig. = 0.000) we conclude that students who said
the VLE was easy to use also generally said that the VLE contained useful information.
In contrast…
► From the Pair 2 correlation (–0.051, Sig. = 0.537) we conclude that students who said
their studies would suffer without the VLE were not generally the same as those who
said they did not mind missing lectures (N.B. there is no relationship either way as the
correlation is so low).
Guide to SPSS for Information Science 144
► The third table gives the t Test results – the important values are in the Sig. (2-tailed)
column.
► Here the stronger the agreement with the statement the bigger
the number.
► Also important with the t Test it is to be clear what the sign of the difference of means
signifies i.e. which number is being subtracted from which.
► For Pair 1 the calculation is: „easy‟ – „useful‟, which is 4.03 – 4.13, which is negative.
► The means of the first pair are not found to be significantly different as the difference is
small (– 0.93)
(Sig. = p = 0.127 so p > 0.05).
► The means of the second pair are „definitely‟ different as the difference is large (+0.967)
(Sig. p = 0.000 means p < 0.005 so p < 0.001).
► From the Pair 1 t Test (Sig. = 0.127) we conclude that students on average rated the
VLE‟s ease of use about the same as it‟s having useful information (the difference in
average rating is very small: i.e. only –0.093 on a 5-point scale).
► From the Pair 2 t Test (Sig. = 0.000) we conclude that students on average rated suffering
without the VLE higher than not minding missing lectures (the difference in average rating
is quite large: i.e. +0.967 on a 5-point scale).
► Warning note: This particular example illustrates the problems with having NEGATIVE
statements like „Don‟t mind …‟ – the results can be quite hard to interpret as you may find!
Avoid negative statements in questionnaires if you can as they often lead to „double
negatives‟ when interpreting.
► Finally, the responses were compared in pairs. It would be nice, say, to compare all four
together. That is not possible with a t Test … but that kind of analysis can be done using
One-Way Analysis of Variance (ANOVA), which is the subject of T29.
Guide to SPSS for Information Science 145
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Move the scale variable modules into the Test Fields box.
8. Click Run.
► The result is that the difference is not significant (p = 0.625) so the Null Hypothesis is to be
retained. I.e. there i9s no evidence of a difference in the number of modules males and
females choose.
► This is the same conclusion as was reached using the Independent Samples t Test in T27.2.
► If you do not perform step 6 (to enter Settings and choose which test to use) then SPSS will
make the decision for itself. It may use the Kruskal-Wallis One-Way ANOVA test (see T29.2)
which gives exactly the same result.
Guide to SPSS for Information Science 146
This test requires the variables to be scale. It will work if ordinal variables are reassigned to scale,
where appropriate.
Here we test if the Average Selling Price (ASP) of the Top-selling 100 books is significantly different
from the Recommended Retail Price (RRP). This was done in T27.4 using a t Test.
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
3. Click on Fields.
4. Move the scale variable RRP into the Test Fields box.
5. Move the scale variable ASP into the Test Fields box.
6. Click Run.
22:
► The result is that the difference is highly significant (Sig. = 0.000 so p < 0.001) so the Null
Hypothesis is to be rejected.
► This is the same conclusion as was reached using the Independent Samples t Test in T27.4.
Guide to SPSS for Information Science 147
Analysis of Variance (ANOVA) and its more complex forms – ANCOVA, MANOVA, and MANCOVA –
provide a very powerful set of methods for comparing sample means to see if there is evidence to infer
that the underlying populations from which they are derived are different. However, these methods can
be complex both conceptually and procedurally. This Guide only introduces some of the more basic
methods and cannot do justice to the underlying statistics. A good statistics textbook or SPSS textbook
is essential.
ANOVA is a procedure for comparing sample means for one dependent variable (scale data – e.g.
statistics exam mark) for one or more independent variables (categorical data, also known as nominal
data – e.g. gender) to see if there is statistically significance difference from which one can infer that
the populations from which these samples came themselves are different.
ANOVA is a called a univariate method because it has one dependent variable (e.g. overall exam mark).
The dependent variable must always be scale (= interval or ratio). [Other criteria are that the underlying
populations from which samples are drawn should be normal, variances equal, sampling random.]
It is one-way ANOVA if there is just one independent variable (e.g. GCSE English grade). It is two-way
ANOVA if there are two independent variables (e.g. gender and racial group).
The independent variable must always be categorical (= nominal). It may take just two values (e.g.
male or female) or several (e.g. racial group defined as caucasian / black / asian / hispanic etc.).
The t test is the special case of one-way ANOVA when the independent variable takes only two values.
MANOVA – Multivariate Analysis of Variance – is an extension of ANOVA when there is more than one
dependent variable.
Independent measures: when two or more groups of subjects undergo exactly the same „experience‟
– e.g. male and female students take a calculus exam. Here gender is the independent variable – also
called a „factor‟ (having two levels). ANOVA can test whether in general males and females would have
the same level of performance in the exam. This is referred to as „between-subjects‟ as it looks at the
differences found between different groups of subjects.
Repeated measures: when each subject experiences more than one level of a factor – e.g. all
students on a module take a test on a topic both before and after doing a practical on that topic.
ANOVA can test whether in general the practical would have an effect upon test performance. This is
referred to as „within-subjects‟ as it looks at the differences found within individual subjects‟
performances „before‟ and „after‟. [Another example would be students on a programme all studying
the same six modules, in which case ANOVA could test for differences in the module results.]
Here we investigate the amount to which students used their department‟s VLE for support. The
students were on one of these six programmes:
The question asked is: “Does the number of modules for which a student used the VLE for support
vary significantly from programme to programme?”
Note: For this first introduction to ANOVA we take the simpler approach:
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open)
► The Post Hoc option produces a comparison of all possible pairs of factors.
► The term post hoc literally means „after the fact‟ and signifies that no decision is made as to
what to compare before the analysis takes place. The alternative approach – deciding
beforehand – is known as „Contrasts‟ (see T29.3).
Guide to SPSS for Information Science 149
6. Click Continue.
8. Click Continue.
► Table 1 (Descriptives) displays the statistics Count, Mean, Std Deviation, etc. for the
modules variable, broken down by the six categories of the prog variable.
► The Descriptives information is a good first place to look for similarities and differences of
the two most important statistics – Mean and Standard Deviation (square root of Variance).
► The Sig. value of 0.459 (i.e. p = 0.459) means p > 0.05 so there is no problem here –
equality of variances can be assumed.
► Table 3 (ANOVA) has the most important result, the ANOVA F value and its Sig. value.
► In this case Sig. = 0.000 (so p < 0.001) and the result is significant at the 99.9% level.
There is very strong evidence that the (population) means are not all the same – i.e.
there is a lot of variation between the groups (programmes).
Guide to SPSS for Information Science 150
► For those interested, the F statistic is the ratio of Between Groups Sum of Squares
and Within Groups Sum of Squares. So its magnitude measures whether most
variation is between different groups or between individuals within the groups.
► Table 4 (Multiple Comparisons) is very large because it compares every one of the
categories with all the others (twice actually!). The Bonferroni test (there are many
others – consult a statistics textbook and take your pick) indicates which pairs differ
significantly – i.e. where Sig. is less than 0.05. An asterisk against the Mean
Difference value highlights them. Bonferroni is a conservative test. Some prefer LSD.
► An asterisk against the Mean Difference indicates a significant pair – there are four
such pairs here, each reported twice.
Guide to SPSS for Information Science 151
► The fifth output is a simple graph – Means Plots – which is a line chart of all the
categories‟ means. This may not seem a very appropriate choice of chart, but that‟s
what SPSS provides for ANOVA.
Guide to SPSS for Information Science 152
Here we repeat the analysis in T29.2 but this time do not use Post Hoc (after the fact) comparisons but
instead use Contrasts (comparisons decided in advance). The six programmes in this analysis are:
In T29.2 the basic question was “Does the number of modules for which a student used the VLE for
support vary significantly from programme to programme?” The Post Hoc test compared all possible
pairs of programmes and produced answers to this question. No decision as to which pairs to look at was
made until after the analysis.
Here we will ask in advance two questions, which the Post Hoc method did not (could not) answer.
Q1: “Is Publishing different from the two other undergraduate programmes taken together?”
For a contrast we have to assign weights: the same positive integer for each member of one group, and
the same negative integer for each member of the contrasting group. The potentially tricky part is that the
sum of all these weights must be zero.
For Q1 we can choose the following simple weights: Group 1: LS: +1, IM: +1 Group 2: PB: –2. We
do not want to involve the other three programmes at all, so assign them zero weight. We have:
Q1
Code UG or PG Programme name Weight
1 UG LS – Library Studies 1
2 UG IM – Information Management 1
3 UG PB – Publishing –2
4 PG ILM – Information & Library Management 0
5 PG IKM – Information & Knowledge Management 0
6 PG EPB – Electronic Publishing 0
Q2: “Are the three undergraduate programmes taken together different from the two postgraduate
programmes (excluding EPB) taken together?” [We exclude EPB has only 5 students.]
We want to contrast LS+IM+PB with ILM+IKM so deciding the weights here is less obvious than in Q1.
For Q2 we can choose these weights: Group 1: LS: +2, IM: +2 PB +2 Group 2: ILM: –3, IKM: –3.
Q2
Code UG or PG Programme name Weight
1 UG LS – Library Studies 2
2 UG IM – Information Management 2
3 UG PB – Publishing 2
4 PG ILM – Information & Library Management –3
5 PG IKM – Information & Knowledge Management –3
6 PG EPB – Electronic Publishing 0
Guide to SPSS for Information Science 153
These are the simplest weight choices for Q1 and Q2, but there infinitely many equivalent possibilities.
Having decided in advance on the contrasts, we now proceed with the analysis.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open)
3. Click Reset and move the scale variable modules into the Dependent List box.
► Note that the association of the weights with the programmes is done by the order in
which the weights are entered. This must correspond with the numeric codes (1 to 6)
assigned to the programmes, as shown in the Q1 table on the previous page.
15. Click the Options button and select Homogeneity of variance test.
Guide to SPSS for Information Science 154
► Four tables are produced. The first is an ANOVA table (not shown) which we ignored here.
► Table 1 (Test of Homogeneity of Variances) reports Levene‟s Test result. This determines
which line of the Contrast Tests table we read (shown later) to find the significance.
► In this case Levene‟s Test result is not significant as Sig. > 0.05, so equality of variances
can be assumed.
► Table 3 (Contrasts Coefficients) presents the weights used in each Contrast. It is a good
idea to confirm that these are what you wanted!
► Table 4 (Contrast Tests) is the most important and presents the results for each Contrast.
► We can assume equal variances here, so read from the top two lines.
If the criteria for the normal (parametric) ANOVA are seriously violated then a nonparametric
version should be used. For the One-Way between-subjects ANOVA SPSS supplies the
Kruskal-Wallis One-Way ANOVA. As a demonstration, we repeat the analysis just carried
out in T29.2 and T29.3.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open).
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
6. Select Run.
► Alternatively, before clicking on Run you could click on Customize tests and explicitly select
Kruskal-Wallis 1-way ANOVA (k samples).
► The conclusion is that there is a significant difference. Sig. = 0.000, so p < 0.001 and the
significance level is actually much greater than the default 95% (i.e. 99.9%).
Guide to SPSS for Information Science 156
Here we investigate whether students‟ marks are significantly different across four different modules.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open)
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial. Remember to
click Apply before clicking OK.
► You can use the blue „up‟ and „down‟ arrows to alter the order of the variables. The order can
matter as they are referred to by number later on …
9. You could now click Plots and move „marks‟ into the Horizontal Axis box, and click Add to
insert „marks‟ into the Plots box to eventually produce a simple line plot showing the four means
(similar to what was done in T29.2), but we will not do so here.
11. Move „marks‟ into the Display Means for box. [This will generate Table 8.]
12. Select Compare main effects. [This will generate Table 9.]
13. Change the Confidence interval adjustment method to Bonferroni. [An option for Table 9.]
► This will later produce a table showing the means, sds and counts of the four variables.
► Note that the Significance level is set at 0.05, so the Confidence intervals are 95%.
► Module D looks very different, having a much lower mean and much greater variability.
► N = 149 is one less than the expected 150 because one case has some missing values.
Guide to SPSS for Information Science 158
► Provided Mauchly’s Test result is not significant (i.e. provided Sig. > 0.05) then sphericity
can be assumed and the top line of the next table is used, otherwise a lower line is used.
► Table 4 (above) shows in this case that Sig. = 0.000 which is highly significant. This is
because Module D has a very different standard deviation from the other three modules – its
variation is much greater. So in this case sphericity cannot be assumed.
► We know from Table 4 (Mauchly’s Test) that sphericity cannot be assumed, so we cannot
use the top line of Table 5 (below). Instead we use the second line – the Greenhouse-
Geisser line (we could choose Huynh-Feldt or Lower-bound but Greenhouse-Geisser is
the most popular test).
► The Greenhouse-Geisser line has Sig. = 0.000 so it is highly significant (p < 0.0005). We
conclude that there is definitely a within-subjects effect – i.e. there is a significant difference
between the module marks. However, it does not indicate where the main differences lie –
that comes later in Tables 8 and 9.
► You may have noticed that in Table 5 (above) all four lines have Sig. = 0.000, and may
therefore wonder what all the fuss was about! Well that‟s real statistics – being cautious in
coming to conclusions.
► Tables 8 and 9 are optional. They are useful for revealing where the main differences lie.
► Table 8 (Estimates) provides data on the four modules‟ marks. The last line (corresponding
to Module D) is very different from the others:
Its Mean is much lower, showing that the marks are mostly lower.
Its Std. Error is much larger, showing that it has a lot more variability.
Its 95% Confidence Interval is much lower (and wider), which is a consequence of the
above two.
Guide to SPSS for Information Science 161
1 = Module A
2 = Modiule B
3 = Module C
4 = Module D
► Of course, since 1 differs from 4 then 4 differs from 1, and so on, which is why the results
appear twice, but they are only highlighted once above.
► Note that for the significant rows (highlighted) the Confidence Interval does not include zero
as a possibility.
► Note that for the non-significant rows the Confidence Interval does include zero as a
possibility.
If the criteria for the normal (parametric) ANOVA are seriously violated then a nonparametric
version should be used. For the One-Way within-subjects ANOVA SPSS supplies the
Friedman’s ANOVA method. As a demonstration, we repeat the analysis just carried out in
T29.5.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open).
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
4. Move
Marks_modA, Marks_modB,
Marks_modC, Marks_modD
into the Test Fields box
5. Select Run.
► SPSS automatically
determines which test to apply.
6. The result shown below is to reject the null hypothesis (that the mean marks on the four
modules are all the same). This is the same conclusion as was reached in T29.5 but with
a lot less effort (but this is a test with less power – a concept not discussed in this Guide).
► Note: This is really a one-way test but just to confuse you it is called a 2-way test because in
a within-subjects ANOVA the participants can be considered to constitute a factor.
Guide to SPSS for Information Science 163
Here we investigate whether school students‟ enjoyment of mathematics depends upon the student‟s
gender, the teacher‟s gender, or an interaction between those two factors. The students were in Y11 –
aged 15-16 years. Their level of „enjoyment„ was assessed by their responses to 12 multiple choice
questions.
4. In the Variable Lists section select Display names and Alphabetical, to match the variable list
format used in this tutorial and click OK.
► It is a good idea to
widen the Univariate
window so you can fully
see the variable names.
12. Click OK
► Table 1 (Between-Subjects Factors) simply reports on the value labels and counts for the
two factors.
► Table 2 (Tests of Between-Subjects Effects) is the most important. It signifies where any
significant differences (effects) are found.
► The STUDENT_SEX row shows that this is not a significant factor (i.e. there is not evidence that
a females and males would have different level on enjoyment).
► The TEACHER_SEX row shows that this is not a significant factor (i.e. there is not evidence that
the students with female teachers and male teachers would have different levels on enjoyment).
► The STUDENT_SEX *TEACHER_SEX row shows that this is a significant factor (i.e. there is
evidence of an interaction between the two factors – although it does not say what it is. That
can be discovered by examining a later table.
► The next three tables are optional and appear because we asked for them in steps 9 to 11.
Guide to SPSS for Information Science 165
► The Student’s Gender table shows that the mean scores and variability of scores of Female
students and Male students were very similar. This explains why the result was not significant.
► Table 3 (Student’s Gender) shows that the mean scores and variability of scores of students in
Female teachers classes and students in Male teachers classes were very similar. This
explains why the result was not significant.
► Table 4 (Teacher’s Gender) shows that the mean scores and variability of scores of students in
Female teachers classes and students in Male teachers classes were very similar. This
explains why the result was not significant.
► Table 5 (Student’s Gender * Teacher’s Gender) is an interaction table which reveals where
the differences lie.
For female teachers, the female students recorded higher enjoyment than the male students.
For male teachers, the male students recorded higher enjoyment than the females.
The interaction effect was more pronounced among the male students:
o Males with female teachers recorded the lowest level (mean 19.9).
o Males with male teachers recorded the highest level (24.6).
► A final word of caution: this was based on a small study involving only a few teachers. A much
larger study replicating this finding would be needed to be able to claim it held true generally.
Guide to SPSS for Information Science 166
Here we investigate whether there are significant differences in school students‟ attitudes to the value
of mathematics depending on two factors labeled TEST and TIME.
The students were in Y11 – aged 15-16 years. Their levels of perceived „Value-to-Self‟ and „Value-to-
Society‟ were assessed by their responses to 22 (12 Self and 10 Society) multiple choice questions
administered as part of two identical questionnaires administered at two different times, labeled „Initial‟
and „Final‟.
(1) TEST: This has two values: TEST1 = „Self‟ and TEST2 = „Society‟.
(2) TIME: This has two values: TIME1 = „Initial‟ and TIME2 = „Final‟.
1. Load data file: File Open Data DATA06_School_Maths.sav (if not open)
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format used in this tutorial.
5. Click Add.
8. Click Add.
9. Click Define.
► It is a good idea to widen the Repeated Measures window so you can fully see the variable
names.
10. Scroll down the variables list (in alphabetical order) to locate and select
VALUE_SELF_1, VALUE_SELF_2, VALUE_SOCIETY_1, VALUE_SOCIETY_2:
11. Use the blue arrow to move all four variables into the Within-Subjects Variables window:
► Table 2 (Descriptive Statistics) which is optional, displays the dependent variables‟ means,
standard deviations and N (number of cases).
► We can see that all means look very similar but the „Value to Society‟ standard deviations are
somewhat lower than the „Value to Self‟ standard deviations.
Guide to SPSS for Information Science 169
► Table 4 (Mauchly’s Test of Sphericity – not shown) is generated automatically but is not
relevant when the variables only take two values, as is the case here.
► Neither TEST (Sig. = 0.263) nor TIME (Sig. = 0.944) has a significant effect, but the
interaction TEST*TIME (Sig. = 0.005) does. By looking at the Descriptive Statistics
output (Table 2 – shown earlier) we deduce that the explanation is that whereas „Value-to-
Self‟ declined from Initial test to Final test the opposite was true for „Value-to-Society‟.
► Two further tables are automatically produced but they are of no interest here.
Guide to SPSS for Information Science 170
Here we test to see if the distribution of modules accessed on the VLE is normal. This is a
variable with relatively few (13) values – usually this test would be applied to a distribution
taking many values.
1. Load data file: File Open Data DATA03_LSquestionnaire.sav (if not open)
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and File, to match the list format used in this tutorial.
► If there are any variables listed in the Test Fields box select them and move them out using
the blue arrow.
4. Move the scale variable modules into the Test Fields box:
WARNING: IN SPSS 19 THIS DOES NOT SEEM TO WORK. HOWEVER, YOU CAN GET THE
RESULT(S) YOU WANT BY OMITTING STEP 6 AND INSTEAD LET SPSS DECIDE WHICH TEST
TO APPLY – i.e. SELECT „Automatically choose the test based on the data‟.
Here we test if the distributions of Average Selling Price (ASP) and Recommended Retail Price
(RRP) of the Top-selling 100 books are normal. These variables can take many values.
4. If there are any variables in the Test Fields box remove them using the blue arrow.
5. Move the scale variables ASP [Average Selling Price] and RRP [Recommended Retail Price]
into the Test Fields box.
12. Click Run. WARNING: IN SPSS 19 THIS DOES NOT SEEM TO WORK FOR THIS DATA FILE.
Note: It must be remembered that having a regression equation does not mean that variation in the
independent variable(s) causes the variation in the dependent variable. It just means there is an
association (just as for correlation, to which regression is closely related).
Here we seek a regression equation (linear) which can be used to predict the value of a dependent
variable given a value of an independent variable. This is the simplest regression model. More complex
models are sometimes used to find a curvilinear equation (not a simple straight line).
2. Select Edit Options and click on the General tab and in the Variable Lists section
choose Display names and Alphabetical, for the most useful listing format here.
4. Click OK.
Histogram
Normal probability plot
9. Click Continue.
► Table 2 (Model Summary) provides the Pearson correlation coefficient between the
independent variable Teacher supportiveness and the dependent variable Enjoyment
of maths (on a scale 0 to 40). In this case the correlation is 0.540 and so the Adjusted R
Square is 0.289 which shows that about 30% of the variation in Enjoyment can be
„explained‟ by Teacher supportiveness, and the relationship is positive.
► Table 3 (ANOVA) reports the ANOVA result showing the significance of the regression
model. Here the Sig. associated with the F test is 0.000 (i.e. p < 0.0005) which is highly
significant, confirming that the independent variable does explain a significant amount of
the variation in the dependent variable.
Guide to SPSS for Information Science 175
► The conclusion one might draw is that if a teacher is supportive then the student is more
likely to enjoy mathematics. However, this is a only measure of a student‟s perception of
teacher supportiveness, so it might be argued that if a student enjoys mathematics then
they think the teacher is supportive. Cause and effect are not easily determined!
► In Table 3 (ANOVA) the Mean Square column (produced by dividing the Sum of
Squares by the df) gives the variance. Here very much more of the variance is explained
by the Regression line (4797.093) than by the Residual (44.868) which can be considered
as unaccounted for „error‟. This reinforces the conclusion that the model is good.
which can be used to predict the Enjoyment level (0 to 40) for any given Teacher
supportiveness level (0 to 48).
► The Standardized Coefficient Beta tells us the contribution the variable makes to the
model. In this case there is just one variable and its contribution is 0.540 (54%), which is
the Pearson correlation shown in an earlier table.
► The t value of 1.864 and associated Sig. of 0.063 (i.e. p = 0.063) for the constant is just
above p = 0.05 so one cannot rule out the possibility that the true value of the constant in
the equation is zero, although that is unlikely.
► The t value of 10.340 and associated Sig. of 0.000 (i.e. p < 0.0005) for the Teacher
supportiveness independent variable shows that the regression is statistically significant.
► Table 5 (Residuals Statistics) is shown here but will not be discussed. The two plots
which follow provide a more visual approach to examining residuals (what the model does
not „explain‟).
Guide to SPSS for Information Science 176
► The optional Histogram plot of the Regression Standardized Residuals, with normal curve
superimposed, shows a good fit confirming that the distribution of the residuals is normal
which is a condition for the validity of the linear model.
► The optional Normal P-P Plot of the Regression Standardized Residuals shows a very good
fit between the expected cumulative probability and the observed cumulative probability,
confirming that the distribution of the residuals (i.e. the variations from the predicted line) can
be considered normal, which is a condition for the validity of the linear model.
Guide to SPSS for Information Science 177
Here we seek a regression equation (linear) which can be used to predict the value of a
dependent variable given the values of several independent variables.
1. Load data file: File Open Data DATA06_School_Maths.sav (if not already loaded).
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
3. Click Reset.
► Table 1 (Variables Entered/Removed) states the four independent variables and the entry
method chosen („Enter‟). This entry method choice means that all four independent variables
will be used in the model, even if their contribution is very small.
Guide to SPSS for Information Science 178
► Table 2 (Model Summary) provides the overall Pearson correlation coefficient between the
independent variables and the dependent variable. In this case the multiple correlation is
0.822 and so the Adjusted R Square is 0.671 which shows that about 67% of the variation in
Enjoyment can be „explained‟ by the model comprised of the four independent variables.
► Table 3 (ANOVA) reports the significance of the regression model. Here the Sig. associated
with the F test is 0.000 (i.e. p < 0.0005) which is highly significant, which confirms that the
model does explain a significant amount of the variation in the dependent variable.
► In Table 3 the Mean Square column shows that very much more of the variance is explained
by the Regression line than by the Residual (2767.213 compared to 20.873). This reinforces
the conclusion that the model is good.
► Table 4 (Coefficients) presents the coefficients for the regression equation, which is:
which can be used to predict the Enjoyment level for any given levels of the four variables.
►
Guide to SPSS for Information Science 179
► The above regression equation could be simplified, with little loss of accuracy, to:
► The Standardized Coefficient Beta tells us the contribution each variables makes to the
model (measured in standard deviation units of the target variable). In this case Confidence
is the most important: a change of 1 SD in Confidence would lead to a change of 0.45 SD in
Enjoyment.
► The t value of 1.864 and associated Sig. of 0.063 (i.e. p = 0.063) for the constant is just
above p = 0.05 so one cannot rule out the possibility that the true value of the constant in the
equation is zero, although that is unlikely.
► The t value of 10.340 and associated Sig. of 0.000 (i.e. p < 0.0005) for the Teacher
supportiveness independent variable shows that the regression is statistically significant.
In the previous section we found that the regression equation using the four independent
variables was approximately
This raises the question of whether it is sensible or worthwhile including the last two variables
whose contribution is quite small. By changing the variable entry method SPSS will take care
of this by only including variables which contribute significantly. This is controlled by the
Method drop-down menu, which has five options:
Forward Enters variables one at a time (in order of importance) until no significant
improvement occurs.
Backward Enters all variables at once then removes one at a time until no significant
improvement occurs.
Remove Following Enter method it removes any variables the user chooses.
Here we repeat T31.2 using the Stepwise method of entry (which gives the same result as
using the Forward method in this case – for the reader to verify).
Guide to SPSS for Information Science 180
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
► As before, several tables are produced (we only show two of them here). This time they are
bigger than in 31.2 as each table shows all the outputs for each of the successive models
which are created as each variable is added (or removed). The models are numbered 1, 2, ...
until the process is completed.
► Below we show the Model Summary table, which has three models containing 1, 2 and 3
variables (as listed in the footnotes). The R shows that the multiple correlation coefficient
increases as more variables are added.
Guide to SPSS for Information Science 181
► Below we show the Coefficients table. From this the actual models (regression equations)
can be derived.
The first model has just one variable (the most influential) – Confidence – leading to this
regression equation:
The second model has just another variable added – Usefulness to society – leading to this
regression equation:
The third model has just another variable added – Teacher supportiveness – leading to this
regression equation:
There is no fourth model as the variable Usefulness to self does not make a sufficient
contribution and so is not entered (from 31.2 we know that it would produce a multiple
correlation of 0.822).
► We do not show the other tables here. They are interpreted much as they were in T31.2.
► It may seem strange that in the four variable model derived in T31.2 the most influential
variable is Confidence but in the three variable model it is Usefulness to society. The
explanation is that when Usefulness to self is excluded its contribution is mostly taken up by
Usefulness to society. This is because Usefulness to self correlates much more highly
with Usefulness to self (+0.739) than it does with the other two independent variables
(+0.476 and +0.487).
Guide to SPSS for Information Science 182
Here we develop a regression model to predict the value of a variable which can take one of
just two values. In this example the dataset comes from a research project (2008-9)
investigating the factors which determine whether or not a student will continue studying
mathematics into Y12.
For details of the project and the derivation of the variables see the Appendix.
Briefly, there are five variables we will use – derived from a Final Questionnaire taken in May
2009 by Y11 students, before public examination results were known – to be used to predict
whether or not an individual student will continue to study mathematics in Y12 (starting Autumn
2009). (The dataset has several other variables which might be used.)
The „F‟ at the end of the variable names indicates that this is the result from the Final
questionnaire rather than the Initial questionnaire which was completed at the beginning of the
academic year (September 2008).
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
5. Click on the Categorical button and move the STUDENT_SEX variable into the Categorical
Covariates box which appears.
► This method chooses the most influential independent variable and adds it to the model, and
continues adding a variable until no significant improvement is achieved.
► In contrast, the default method adds in all the selected variables en bloc (shown in T32.2).
13. Click OK which generates a large number of output tables, as explained below:
► With all SPSS analyses there is the option to select just a subset of cases, and therefore
exclude others – e.g. select only females. (This process is explained in Part 1 of this Guide in
Section 8.2.) However, in this example there are no unselected cases (the default), so all
possible cases are included.
► This procedure produces a plethora of tables, most of which can be skimmed over by the
beginner. They are all introduced here, but by far the most important are Tables 8 and 9.
Guide to SPSS for Information Science 184
► Table 1 (Case Processing Summary) reports that all 282 cases have been selected and
that of these 200 are included in the analysis and 82 are missing (due to missing data), and
that there are no unselected cases.
► Table 3 (Categorical Variables Coding) reports the one categorical variable (Student‟s
Gender) and the frequencies of the associated values.
Guide to SPSS for Information Science 185
► Table 4 (Classification Table – for Block 0: Beginning Block) reports the first stage of the
modeling process. The model‟s purpose to predict the „0‟ and „1‟ correctly for as many cases
as possible. The first stage („Block 0‟ here) just uses a constant predictor. As there are 118
„No‟ and 82 ‟Yes‟ cases, taking the cut value as half (the default), SPSS calculates that „No‟
has more cases than „Yes‟ and so assigns „No‟ as the constant prediction for all 200 cases.
This ensures that the percentage correct prediction is over 50%.
► Table 4 (above) reports that the overall percentage is 59% because all 118 „No‟ cases are
predicted correctly and none of the 82 „Yes‟ cases are predicted correctly. Not a very
sophisticated predictive model, of course.
► Tables 5 and 6 (Variables …) report the initial stage of the modeling process, when there is
just a constant in the model and none of the proposed variables in the model.
Guide to SPSS for Information Science 186
► Table 7 (Omnibus Test of Model Coefficients) is somewhat complicated. The first point
to note is that the Model and the Block rows are the same (the default) so „Block‟ can be
ignored here. The second point to note is that the word „Step‟ is used in two different
ways! Steps 1, 2, … record the process of successively adding in another predictor (one
new variable each time). Within each Step, „Step‟ shows the effect on Chi-square for that
Step, and „Model‟ shows the overall Chi-square for that Step.
► Note: The larger Chi-square is, the more significant (predictive) the model is. The aim,
therefore, is to find a model with as large a Chi-square as possible, by choice of variables.
► Step 1 is the constant model and this has a highly significant effect (Sig. = 0.000 means p
< 0.0005). So the constant is assessed to be a useful (significant) predictor.
► Step 2 is the constant plus one variable (note it does not report which variable at this
point). The Step 2 Step Chi-square entry of 6.817 is to be added to the Step 1 Model Chi-
square value (91.255) to give the corresponding Step 2 Model Chi-square value which is
98.072. This Step too has a highly significant effect (Sig. = 0.009). So the constant plus
one variable is assessed to be a useful (significant) predictor.
► Step 3 is the constant plus two variables. The Step Chi-square entry of 4.252 is to be
added to the previous value (98.072) to give the corresponding Model Chi-square value
which is 102.324. This Step too has a significant effect (Sig. = 0.039). So the constant
plus two variables is assessed to be a useful (significant) predictor.
► Step 4 is NOT there because adding any further variable does significantly improve the
prediction (i.e. Chi-square does not increase much). So the modeling process is
terminated.
Guide to SPSS for Information Science 187
► Table 8 (Model Summary) shows a measure (- 2 Log likelihood) of how well the model
fits the data – a perfect fit would be zero (this column can be ignored here).
2
► Table 8 also provides two different estimates of the R value which indicates what
percentage of the dependent variable can be „explained‟ by the model. Note that this
increases with each step, for both „conservative‟ and „optimistic‟ estimators. It appears
here that the model explains about half the variability.
► Table 9 (Classification Table – for Block 1) is important because it shows the predictive
ability of the model at each stage.
► Step three (the final model in this case) reports that the model correctly products 80.0% of
cases. Of the 118 who actually do not go on to study maths in Y12, it correctly predicts 99 will
not (which is 83.9%) and of the 82 who do who do it correctly predicts that 61 will (which is
74.4%), giving overall 80.0% accuracy.
Guide to SPSS for Information Science 188
► Three further tables appear – Tables 10 to 12 – but only Table 10 (Variables in the
Equation) is important and reproduced here. It shows which variables are included in the
model at each stage. It also gives indicators of the variables‟ importance (contributions)
although it is not always as easy to interpret as for simple linear regression.
► The B column indicator indicates the weight used in the model for the given variable. The
bigger the weight the more important the contribution. However, the weight depends upon
the scale used for the variable – the bigger the range the smaller will be the weight.
► In this example, except for STUDENT_SEX, the scales are 0 to 48 and 0 to 40 so they are
more-or-less comparable and B itself is quite a good indicator for comparing the
contributions of the variables. Here the Step 3 B column entries show that Enjoyment
(0.135) is by far the most important variable, with Confidence (0.71) and Teacher
Support (0.70) much less important and about equal.
► The Wald column is a measure of the significance of the B value for the variable: higher
values indicate greater contribution, and take into account the df (degrees of freedom).
Here df is „1‟ for all variables so one can directly look at the Wald numbers. It confirms
that Enjoyment is by far the most important variable (Sig. = 0.000), with Confidence and
Teacher Support much less so and about equal. (For these last two the p values are
almost 0.05 – the default entry criterion, so they only just qualify for inclusion.)
► For the sake of completeness, and not expecting full understanding from all readers, the
regression equation for this model is given below. Unlike the simple linear regression model,
this produces a probability. If the probability is above the cut value (usually chosen as 0.5) the
outcome is considered „1‟ i.e. „Yes‟ and otherwise is considered „0‟ i.e. „No‟. The equation
involves logarithms which can be re-expressed in terms of exponentials like this:
Prob („1‟) =
1 / {1 + exp(-B0) x exp(-B1 x Variable 1) x exp(B2 x Variable 2) x exp(-B3 x Variable 3) x …}
This follows directly on from T32.1 but uses the „Enter‟ entry method which forces all the
selected variables to be used all at once in the model. This is clearly quicker, but does not
allow SPSS to decide which variables are not worth including.
If continuing directly from T32.1, you need only repeat step 2 and proceed directly to step 12:
5. Click on the Categorical button and move the STUDENT_SEX variable into the Categorical
Covariates box which appears.
6. Click Continue.
13. Click OK which generates fewer output tables than before; only the most important two are
shown below.
Guide to SPSS for Information Science 190
► The Classification Table shows that including all six variables gives an accuracy rate of 82.5%
(compared to 80.0% with three variables).
► The Variables in the Equation Table shows that the only significant contributors (from the
Wald and Sig. columns) are:
► The reason this model has a different third variable is that there are correlations between all the
pairs of variables and including the extra three diminishes the influence of CONFIDENCE_F.
► Although STUDENT_SEX has a relatively large B value, it is in fact of negligible importance (Sig.
= 0.654). Its weight B is large because it only takes small values 1 and 2 so its range is 1 and its
mean about 1.5. The other variables have actual ranges of about close to 40 and 48 and means
around 20 to 30 (see the Descriptive Statistics table below to confirm this).
Guide to SPSS for Information Science 191
Reliability is the ability of a test to be consistent in its outcome. This differs from Validity which is the ability
of a test to measure accurately what it is designed to measure. Both are important in a questionnaire or
test. One aspect to achieving a reliable test instrument is to have a series of similar questions about the
topic, attitude or opinion under investigation – some worded positively, others negatively – and to assess
how consistent the answers are. There are various ways to achieve this – here we introduce Cronbach‟s
Alpha method which is the most popular method. Reliability is measured on a scale of 0 to 1 with 1
indicating perfect reliability and 0 no reliability. A value above 0.75 is generally considered good, and a
value above 0.9 is something to aspire to.
Cronbach‟s Alpha is derived from the mean of the correlations between all pairs of items (r) and the
number of items (n). For those interested, the formula is: alpha = n x r / (1 + (n -1) x r). This means that as
n gets larger, alpha will get closer to 1 (however small r is), and, as r gets closer to +1, alpha will get
closer to 1 (however small n is).
Cronbach‟s Alpha method can be used in two ways: to develop a valid coherent set of questions – a
„scale‟ (by weeding out „poor‟ questions) and to provide evidence that a questionnaire used in research is
indeed reliable, so conclusions derived have a solid foundation.
Factor Analysis (the subject of T34) can be useful in determining which items go together to form a
coherent scale (to produce a scale that is uni-dimensional, or to identify sub-scales within it). The set of
identified items forming the scale, or sets forming subscales, can then be tested for reliability.
Here we look at a set of 12 questions about the perceived usefulness of mathematics to school students,
which was part of the research project introduced in T31, about which further details can be found in the
Appendix.
A typical „scale‟ will consist of a set of multiple choice questions on a 5-point scale (or 7-point scale).
Before the responses to the questionnaire can be analysed for reliability, the results for all negatively
worded questions must be „turned round‟, so that for every question on a 5-point scale a „5‟ means a very
positive attitude to the attribute being assessed, and „1‟ means a very negative attitude.
For a multiple choice question on a 5-point scale this is easily achieved by computing a new variable for a
negatively worded question whose values are given by „new code = 6 – original code‟. In this example this
has already been done and the 12 „positive‟ variables produced are shown above, which also shows the
actual questions used.
Guide to SPSS for Information Science 192
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
3. Scroll down through the long list of variables to locate and select Uself01F to Uself12F (this is
most easily done by enlarging the window so all the 12 variables are visible, then click the first
variable and shift-click the last variable) and move them all together into the Items box, as
illustrated:
5. Click Continue.
Guide to SPSS for Information Science 193
6. Click OK.
► Table 1 (Case Processing Summary) reports that of the 282 cases available for the
analysis, 40 were excluded (due to missing data).
► Table 2 (Reliability Statistics) reports that for the scale of 12 items Cronbach‟s Alpha is
0.919. This is a very high value, indicating excellent reliability.
► Table 3 (Item-Total Statistics) lists the 12 items in the scale (and shows their labels which
include the questions). The last column is very important as it indicates whether the scale
could be improved by excluding any questions.
► In this case the overall Alpha value is 0.919 but if the last item were removed the Alpha would
rise a little to 0.921.
►
Guide to SPSS for Information Science 194
The next step, then, is to repeat the analysis excluding that item, as follows:
9. Click OK.
► This time the Cronbach‟s Alpha value is 0.921, with N = 11, as predicted in the Item-Total
Statistics table above.
► What is a little surprising is that new Item-Total Statistics Table now says that the
Cronbach‟s Alpha value can be further improved by removing item 10 (see below).
► The Cronbach‟s Alpha is now 0.922 using a 10-item scale, and cannot be further improved.
Guide to SPSS for Information Science 195
Here we attempt to create a scale to measure the commitment of a student to mathematics, using
responses to six questions.
1. Load data file: File Open Data DATA06_School_Maths.sav (if not loaded)
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
3. Click Reset.
4. Widen the Reliability Analysis window (to be able to see the longest variable names) and
locate and move the following seven variables into the Items box:
CAREER_MATHS_EXTENT
HE
HE_MATHS_EXTENT
MATHS_IMPORTANCE
MATHS_POSITION
MATHS_Y12
Y12_MATHS_PREDICTION
6. Click Continue.
► Studying the Corrected Item-Total Correlation column in the table above shows that the
second item (HE) has a very low correlation (0.157) with the other items combined, but its
exclusion at this point would not increase the Alpha value (although it should be removed
anyway).
► The last item in that column (showing the label for variable Y12_MATHS_PREDICTION) is
negatively correlated with the remaining combined variables. This is not acceptable in a
scale. It could be deleted to improve the scale (Alpha would then rise to 0.662). However, as
its correlation is actually quite high (–0.634), it is better to keep it but make the correlation
positive by reversing the direction of the coding. This we will now do, using the procedure
explained in Reference Section 9.1.
► We will change this so that „Yes‟ is coded „1‟ and „No‟ is coded „2‟. This can be done by:
either Transform Recode into Different Variables using „0‟ „2‟ and „1‟ „1‟
12. Select Variable View and scroll to the bottom of the list of variables to find Y12_MP_NEW.
15. Click on the Values cell to open the Value Labels window.
16. In the Value box enter „1‟ and in the Label box enter „Yes‟ and click Add.
17. In the Value box enter „2‟ and in the Label box enter „No‟ and click Add.
21. Select Y12_MATHS_PREDICTION in the Items box and remove it using the blue arrow.
22. Locate and select Y12_MP_NEW and move it into the Items box.
► The last column of the Item-Total Statistics table shows that removing the second item
would raise Alpha to 0.713 (which has an unacceptably low correlation anyway), and
removing the fifth item would raise Alpha to 0.793. Removing both should increase Alpha
further. These removals can be done one at a time, or both together.
Guide to SPSS for Information Science 198
25. Select MATHS_POSITION in the Items box and remove it using the blue arrow.
26. Select HE in the Items box and remove it using the blue arrow.
► Now Cronbach‟s Alpha has risen further, to 0.828, which represents high reliability.
► The last column of the Item-Total Statistics table shows that the Item-Total correlations
are all reasonably good, so no more variables need to be removed, and removing any
would reduce the reliability.
► The final scale has been determined, containing five items, with reliability 0.828.
Guide to SPSS for Information Science 199
We investigate how many separate components or factors are evident in a set of ten questions
purporting to measure „enjoyment‟.
1. Load data file: File Open Data DATA06_School_Maths.sav (if not already loaded).
► Use Edit Options to check that in the General window the Variable Lists choices
are Display names and Alphabetical, to match the list format in this tutorial.
4. Open the Descriptives window and select KMO and Bartlett’s test of sphericity.
5. Click Continue.
► The required Display option will already be selected – Unrotated factor solution.
► The required Extract option will already be selected – Eigenvalues greater than: 1.
7. Click Continue.
Guide to SPSS for Information Science 200
9. Select Varimax.
► The required Missing Values option will already be selected – Exclude cases listwise.
► Table 1 (KMO and Bartlett’s Test) provides reports on two important tests that the data is
suitable for Factor Analysis.
► KMO provides a measure of whether the distributions of values in the variables is suitable.
The scale is 0 to 1 with 0.5 the minimum acceptable. The following descriptions for the
values obtained have been provided:
0.9+ = marvelous, 0.8+ = meritorious, 0.7+ = middling, 0.6+ = mediocre, 0.5+ = miserable.
► The value for our set of variables is 0.897 which is bordering on „marvelous‟.
► The value for our set of variables is 0.000 (so p < 0.0005) which is excellent.
► Given these two positive results, it is valid to continue with the analysis.
Guide to SPSS for Information Science 201
► It shows that to start the iteration process the 10 variables are all given an initial communality
of 1. Communality is the amount of variance in the variable explained by all the factors (yet
to be found). The value will reduce as the analysis continues, and must lie between 0 and 1
(similar to a multiple correlation).
► It also shows the Extraction values for each variable – that is the amount of variance of the
variable attributable to the set of components which have been found („extracted‟) by the
Factor Analysis process – in this case there are two as shown below in the Total Variance
Explained table. High values are therefore good.
► Table 3 (Total Variance Explained) lists the 10 initial eigenvalues for the initially assumed
10 components (or factors) – the same number as there are variables. An eigenvalue of at
least 1 is the normal criterion for the existence of a component.
► From the Initial Eigenvalues Total column we see that there are just two components
identified. From the Cumulative % column we see that these account for 71% of the total
variance.
Guide to SPSS for Information Science 202
► Table 4 (Component Matrix) lists the 10 variables and shows for each how much of its
variance is attributable to each (of the two) components. This is of only limited interest. The
rotated component matrix, which follows, is what really matters.
► Table 5 (Rotated Component Matrix) shows the result of a mathematical axis rotation
designed to maximize the effect of one component on a variable and minimize the effect of
all other components. The details are not so important. The result is what matters.
► What the table shows is that the first 7 variables listed (sorted by size of eigenvalue as
requested in step 12) correspond to one component and the last three correspond to the
other component. Note that in the Component 1 column the variances start at 0.840 and go
down to -0.659 and then suddenly jump lower at which point Component 2‟s variances
become large („take over‟).
Guide to SPSS for Information Science 203
► What we have done so far is – perhaps – the easy bit. What remains is the interpretation.
This requires a careful look at what components constitute each component. This requires
knowledge of the research filed and the source of the variables. Even though this is not your
own research, here it is not too difficult. The source is a set of 10 questions about „enjoyment‟
included in a much bigger questionnaire given to Y11 students.
► If you read the 7 questions in Component 1 you will observe that they are all about „interest‟
in the subject (one is negatively worded).
► If you read the 3 questions in Component 2 you will observe that they are all about „feeling
uncomfortable‟ about the subject.
► It does seem that the Factor Analysis has picked out two distinct components in what the
researchers originally thought was a single „construct‟ (to use a psychological term).
► This was a very small set – sets of hundreds of variables (questions) are quite normal and, in
such cases, obviously the computer is a vital tool in finding components. But the
interpretation has to be done by a person!
Guide to SPSS for Information Science 204
1. Load data file: File Open Data DATA06_School_Maths.sav (if not already loaded).
4. Enlarge the Factor Analysis window vertically as much as possible to reveal as many variables
in the list as you can.
5. We need to select S01 to S36 and move them all into the Variables box. The variables will
probably be listed in „Display Variable Names‟ order which is not very helpful here as the ones
we want will appear mixed in with others we don‟t want. It is better if they are listed in „Display
Variable Labels‟ order, then all 36 variables S01 to S36 will be listed next to each other.
Do this by right-clicking on the variable list and selecting „Display Variable Labels‟, then click on
the first (S01) and SHIFT-click on the last (S36), then move them into the Variables box. (Do it
in two halves if necessary.)
6. Open the Descriptives window and select KMO and Bartlett’s test of sphericity.
7. Click Continue.
9. Select Varimax.
► In Table 1 KMO provides a measure of whether the distributions of values in the variables is
suitable.
► The value for our set of 36 variables is 0.919 which is in the „marvellous‟ category.
► The value for our set of variables is 0.000 (so p < 0.0005) which is excellent.
► Given these two positive results, it is valid to continue with the analysis.
► Table 3 (Total Variance Explained – partly shown) lists the initial 36 eigenvalues. An eigenvalue
of at least 1 is the normal criterion for the existence of a component.
► From the Initial Eigenvalues Total column we see that there are 7 components identified. From
the Cumulative % column we see that these account for 62% of the total variance.
► Table 4 (Component Matrix – not shown) lists the 36 variables and shows for each how much of
its variance is attributable to each of the (two) components.
► The Factor Analysis reported in Table 5 clearly identifies three components and, less convincingly,
four more. We now look at the first three in detail.
► Component 1:
► It can be seen that 11 of the 12 variables were designated as „Confidence‟ questions. The twelfth
(S36), although designated as a „Teacher support‟ question can be viewed as relating to
„Confidence‟. This is a very convincing component (or factor). It is a near-perfect match with the
researcher‟s intended factor.
► Component 2:
► It can be seen that all 12 of these variables were designated as „Useful to self‟ questions. This is a
very convincing component (or factor). It is a perfect match with the researcher‟s intended factor.
Guide to SPSS for Information Science 208
► Component 3:
► It can be seen that all 6 of these variables were designated as „Teacher support‟ questions.
Although there are 6 other „Teacher support‟ questions which have „got lost elsewhere‟ this is a
convincing component (or factor). It is in one sense a perfect match with the researcher‟s
intended factor.
► It is not worth spending time trying to make sense of the remaining four components – especially
as they have so few questions. In a much larger study the researcher would wish to investigate
further.
► As a final point, there are links between Reliability Analysis (TUTORIAL T33) and Factor Analysis
(TUTORIAL T34). Having developed and trialled a large set of questions, one can test the
reliability of subsets (potential components) and use Factor Analysis to confirm the validity of
these and also discover other subsets for further investigation and trialling, leading to the
discovery of new „constructs‟. Thus the two methods can work well together.
Guide to SPSS for Information Science 209
This data file contains details of the top 100 best selling books in the period 1989 to 2010.
Cases: 100
Variables: 15: position in the top 100, title, author(s), publisher (imprint), publisher group, number of
books sold (volume), sales value (£), recommended retail price (RRP), average selling price
(ASP), type of binding (paperback or hardback), month of publication, year of publication,
product class code (detailed coding used by publishers to categorise books), genre (crime,
fiction, etc), type (adult fiction, children‟s fiction, non-fiction).
QUESTIONS
1. What are the mean, median and standard deviation of the average selling price (ASP) of books in the
top 100 best sellers list?
2. What are the mean, median and standard deviation of the number of books sold?
4. What is the number of hardback books and paperback books in the 100 best sellers list?
7. How many authors have more than two books in the list?
8. How many of the best-selling books were published in the five years 2006 to 2010? How does this
compare to the previous five years 2001 to 2005?
9. Which month appears to be the best month for launching (a) a paperback, (b) a hardback? Would you
offer advice to publishers on this data? If not, why not?
10. Is there a significant difference between the recommended retail price (RRP) of the books compared to
the average selling price (ASP)?
11. Is there a significant difference between the average selling price (ASP) of fiction and non-fiction books?
These questions relate to responses from 150 Information Science students to a questionnaire concerning
a university‟s VLE. The questionnaire itself is presented on the next page.
Cases: 150
Variables: 24
QUESTIONS
1. What are the differences between undergraduate and postgraduate use of the VLE? Are they
important?
2. Are there any differences in use of the VLE between different programmes?
3. Are there any differences in use of the VLE between full-time and part-time students?
4. Are there any differences in use of the VLE between male and female students?
6. How many modules do students typically access on the VLE? Are their differences for different groups
of students?
7. What information do students access on the VLE? Are their differences for different groups of
students?
8. What do students think about the VLE? Are their differences for different groups of students?
9. Which individual module correlates most highly with the final programme mark?
Guide to SPSS for Information Science 211
VLE Questionnaire .
This Department is evaluating the use of the University’s VLE by undergraduates and postgraduates to
determine whether any changes are required. As a student of this department we would greatly
appreciate your completing the questionnaire. Your responses will remain confidential. Thank you.
5. Have you used the VLE this academic year to access module
information?
Yes No If no please return questionnaire
6. For how many modules (0 to 12) have you accessed the VLE?
No. =
8. During the last two months what information have you accessed on
the VLE? Tick all that apply.
Module specifications
Coursework outlines
Module timetables
Reading lists
Exam papers
Learning materials
INTRODUCTION
SQWconsulting and LISU (Loughborough University) were commissioned by Research Councils UK
(RCUK) to identify the effects and assess the impact of Open Access to research outputs on pay-to-
publish and self-archiving publishing models.
Open Access models provide free online access to research literature either by publishing in an
Open Access journal which does not charge („Gold‟ OA) or by archiving peer-reviewed articles
published in subscription journals („Green‟ OA).
INSTITUTIONAL DATA
A data file supplied with this Guide (DATA04_OpenAccess_HEIs.sav) includes responses from all 39
replying institutions (of 168 contacted); just under half of the questions in the survey are included.
The questions for which response data are provided in DATA04_OpenAccess_HEIs.sav are included
on the following pages.
RESEARCHER DATA
A data file supplied with this Guide (DATA05_OpenAccess_Researchers.sav) includes responses
from 418 replying individuals (of 2122 total responses); about half of the questions in the survey are
included. The institutional variables are included in the Researchers‟ dataset.
The questions for which response data are provided in DATA05_OpenAccess_Researchers.sav are
included on the following pages. See the end of this Guide for links to the Open Access Report.
QUESTION SETS IN THIS GUIDE FOR USE WITH THE SUPPLIED DATASETS
Set 3: General questions covering both datasets, not specifically linked to SPSS.
QUESTIONNAIRES
The actual questionnaires used by SQWconsulting/LISU are presented, in abbreviated form, after the
three sets of questions which now follow.
Guide to SPSS for Information Science 213
Variables: 27 Variables: 77
QUESTIONS
1. According to researchers, what percentage of institutions have their own repository? How does this
differ from what the institutions say? Why do you think there is a difference?
3. Where have researchers published their research in the previous five years? Is this influenced by the
category of researcher?
4. What dates were given for the researchers‟ most recent open access publication? Was one category
of researcher more prolific than the others?
5. What were the researchers‟ main reasons for publishing in an open access journal or repository?
7. Who did the researchers think should bear the cost of publication of research outputs? How does
this differ from how the institutions say open access is funded at their institutions?
8. When do researchers anticipate open access becoming the normal route for publication of research
outputs in their discipline?
9. According to institutions, does their library include open access publications in their catalogues?
10. How is material deposited in the university repositories where they exist?
11. Have measures been put in place to encourage authors to deposit material where it is not
mandated/required?
12. What was the mean number of items deposited or downloaded in the 2006-07 academic year? Did
this vary between different types of institutions?
13. What were the mean number of total items held and the number individual depositors? Did this vary
between different types of institutions?
QUESTIONS
1. (a) Use Frequencies to check the accuracy of the „No of responses‟ column in Table B-1 below
(b) Use Crosstabs to check the accuracy of the „FTE students‟ column.
2. Use Crosstabs to check the accuracy of Table B-3 below [Source: Annexes page 13].
3. Use Frequencies to check the veracity of the statement below [Source: Annexes page 18].
Guide to SPSS for Information Science 215
4. Use an appropriate procedure to check the accuracy of the data in Figure B-2 below [Source: Annexes
page 21]
5. The chart below is essentially the same as that above for „Mediated by repository staff’. Use Chart
Builder to produce a similar chart for „By authors directly‟, The chart above (Figure B-2) can be used
as a check you have the correct data.
Guide to SPSS for Information Science 216
6. Use an appropriate procedure to check the accuracy of Table B-13 below [Source: Annexes page 23]
You will find some discrepancies. Look at the „No. included‟ column which may help – at least partly –to
explain how these discrepancies may have have arisen.
Guide to SPSS for Information Science 217
QUESTIONS
1. Use Frequencies to check the whether the statement below [Source: Annexes page 18] is in accord with
the responses provided by Researchers in the sample data set.
2. Table B-15 below shows on the right the number of researchers in each category based on the valid
responses made by 2116 of the 2122 cases.
[Source: LISU/SQWconsulting Open Access to Research Outputs: Annexes, page 24].
The smaller data set supplied for use with this Guide has 418 cases, which is a 20% sample.
Use the One-sample Chi-square Test to check whether the sample of 418 cases is representative of
the whole survey dataset of 2116 valid cases for „Researcher category‟ (i.e. check that it has
appropriate numbers in each category of researcher to accurately preserve proportionality).
Notes:
(b) You will need to analyse Q02, entering into the Expected Values box each number for the six
Researcher categories in the Count column above. [See TUTORIAL T26 for an example.]
Guide to SPSS for Information Science 218
3. Use Chart Builder to replicate the lower chart in Figure B-3 below, for Research staff
Note: The detailed procedures for obtaining the upper chart demonstrated given in TUTORIAL T15.2.
4. Figure B-6 below [Source: Annexes page 29] reports on whether the researcher said his institution does
or does not have its own repository. Carry out an analysis to see how many of those who claimed to
know, were correct.
Notes:
This will entail using the Researchers‟ data file (DATA05_OpenAccess_Researchers.sav), which
includes all the variables from the Institutions data file, which are named Inst.Q1, Inst.Q2a, etc.
Guide to SPSS for Information Science 219
5. (a) Use Crosstabs and the associated Chi-square Test to check the accuracy of the statement below
[Source: Annexes page 38], by comparing with results from the analysis of the sample data set.
6. Use the One-sample Chi-square Test to check whether the sample of 418 cases is representative of
the whole survey dataset of 2122 cases for the variable „Year of most recent Open Access publication‟.
Use the data in Figure B-9 below [Source: Annexes page 36] to provide the „Expected Values‟.
Notes:
(b) You will need to recode the variable to have just three values for „Year‟.
Guide to SPSS for Information Science 220
Cases: 282
Variables: 180
This data file contains numerous statistics for 282 Y11 school students (aged 15-16 years) participating in
a National Centre for Excellence in Teaching Mathematics (NCETM) funded research project
investigating, among other things, factors which determine whether a student will continue studying
mathematics into Y12. Each Y11 student participant was given an „Initial Questionnaire‟ at the start of the
project, in September 2008, and essentially the same questionnaire was administered again near the end
of the project – the „Final Questionnaire‟ – in May 2009, before public examinations were sat.
A slightly reduced version of the Initial Questionnaire is provided on the following pages. The reader is
encouraged to look at this questionnaire, as it will make it much easier to understand the description of
variables which now follows.
Names and labels for variables in the dataset which derived from the Initial Questionnaire either have no
special identifier or are indicated by „[Initial]‟.
Names and labels for variables in the dataset derived from the Final Questionnaire have the special
identifier „F‟ or „[Final]‟.
Apart from seeking demographic information, there were individual questions about favourite subjects,
future aspirations and study intentions, and then large sets of related questions aimed at investigating the
students‟ views on five aspects of mathematics:
All these questions were in multiple choice format, using a five-point scale with „5‟ indicating „Strongly
Agree‟ and „1‟ indicating „Strongly Disagree‟.
The first three sets of questions were mixed in together (S01 to S36).
The last two sets stood alone (E01 to E10 and Usoc01 to Usoc10).
The Final questionnaire‟s equivalent variables are S01F to S36F, E01F to E10F, Usoc01F to Usoc10F.
Some questions were positively worded and some negatively. For creating composite indices and in
regression analysis the negative responses had to be „turned round‟ and made „positive‟.
The Usefulness-to-self questions have been copied from the S01–S30 set and transformed to all be
positive and stored as Uself01–Uself12. Those from the Final Questionnaire are: Uself01F–Uself12F.
The Enjoyment questions have been copied and transformed to all be positive and stored as E01plus–
E10plus.
A small number of other variables have been derived from the above. All variables have explanatory
labels.
Guide to SPSS for Information Science 226
Note: Most of these questions require the more advanced statistical tests and SPSS procedures.
T TEST
1. Use the Paired-Samples t Test to see if each of these three pairs of variables can be considered to have
come from populations with the same mean:
2. Use the Independent-Samples t Test to see if the scores for female students (code = 1) and male
students (code = 2) can be considered to have the same mean:
(a) CONFIDENCE
(b) ENJOYMENT
(c) USEFUL_SOCIETY
REGRESSION
3. Create a simple regression equation to predict USEFUL_SELF from USEFUL_SOCIETY. Report on its
validity and accuracy.
4. Create a multiple regression equation using the „Enter‟ entry method to predict USEFUL_SELF from
MATHS_IMPORTANCE, MATHS_POSITION, MATHS_Y12. Report on its validity and accuracy.
5. Repeat 4 using the „Forward‟ entry method. Report on its validity and accuracy, and compare to 4.
6. Use logistic regression with the „Enter‟ entry method to predict the binary variable GCSE_AstarA from
CONFIDENCE_F, ENJOYMENT_F, STUDENT_SEX, TEACHER_SEX, TEACHER_SUPPORT_F.
Report its validity and accuracy.
7. Repeat 6 using the „Forward: LR‟ entry method. Report its validity and accuracy, and compare to that
obtained in 6.
8. There is a flaw in applying the analysis in 6 and 7 to this particular dataset because quite a lot of the
students included in the project had already taken and passed their GCSE mathematics the year before
(usually with high grades), and were studying for AS mathematics or other mathematical qualifications. It
would be more sensible to exclude them from the analysis.
This is done using Data Select Cases as described in Section 8.2. The variable to use is
GCSE_YEAR which should be „2009‟. If you do this you will find somewhat different results. The
excluded cases will have a horizontal line through the case number. Also the message „Filter on‟ will
appear at the bottom right of the Data Editor window in the Status bar (it can be easily overlooked; it is
not shown at the bottom of the Viewer Output window).
Important Note: It is very important afterwards to turn off this selection by revisiting Select Cases again
using Data Select Cases and clicking the All cases button. Then check that the horizontal lines
through the case numbers have all disappeared and that the „Filter On‟ message has gone from the
Status bar.
Guide to SPSS for Information Science 227
ANOVA
9. Perform a One-way ANOVA to test to determine whether the variable CONFIDENCE is significantly
different for male students and female students.
10. Perform a One-way ANOVA to test to determine whether the variable ENJOYMENT is significantly
different for the three levels of teacher support recorded in the variable Teacher_Support_Level, which
has 1 as lowest level and 3 as highest level.
11. Perform a One-way ANOVA to test to determine whether the variable CONFIDENCE_F is significantly
different for the four levels recorded in the variable Useful_to_Society_Level_F, which has 1 as lowest
level and 4 as highest level.
12. Perform a One-way ANOVA to determine whether the mean of the variable TEACHER_SUPPORT is
significantly different from that for TEACHER_SUPPORT_F.
13. Perform a One-way ANOVA to test to determine whether the means of the four variables
USEFUL_SELF, USEFUL_SELF_F, USEFUL_SOCIETY and USEFUL_SOCIETY_F have significantly
different means. (When you finished this task, but not before, be sure to look at 14.)
14. There is a flaw in applying the ANOVA to the variables in 13 because the variables are of two different
scales (0-48) and (0-40). Use Transform Compute Variable to overcome this problem, and
repeat the analysis, and compare your findings.
15. Perform a Two-way ANOVA to test to determine whether the variable CONFIDENCE is significantly
different for male students and female students, for classes taught by male teachers and those taught by
female teachers, and whether there is an interaction effect.
16. Perform a Two-way ANOVA to test to determine whether the variable ENJOYMENT is significantly
different for male students and female students, for the four levels recorded in the variable
Useful_to_Society_Level, which has 1 as lowest level and 4 as highest level, and whether there is an
interaction effect.
KOLMOGOROV-SMIRNOV TEST
17. Use the Kolmogorov-Smirnov test to determine which of the following can be considered to have normal
distributions: CONFIDENCE, TEACHER_SUPPORT, USEFUL_SELF.
18. Use the Kolmogorov-Smirnov test to determine which of the following can be considered to have normal
distributions: CONFIDENCE_F, TEACHER_SUPPORT_F, USEFUL_SELF_F and compare with the
results in 13.
RELIABILITY
19. Use Cronbach‟s Alpha method to determine the best possible scale to measure the „commitment‟ of a
student to mathematics, initially using the following six variables. (It doesn‟t work out as one would
expect, despite being very similar to the example in T33.3.)
CAREER_MATHS_EXTENT
HE
HE_MATHS_EXTENT
MATHS_IMPORTANCE
MATHS_POSITION
MATHS_Y12
Guide to SPSS for Information Science 228
20. Use Cronbach‟s Alpha method to determine the best possible scale to measure the „joy-of-maths‟ of a
student to mathematics, initially using these variables:
21. (a) Starting from the 12 relevant variables within the 36 variables S01 to S36, create a single variable
„CONSCALE‟ as a scale for Confidence-with-mathematics, with „0‟ as its lowest possible value. You
will need to see which 12 of the 36 variables relate to Confidence. Do this by looking at their labels.
Method 1
This can be done in 3 stages:
Step 1: Make all 12 variables positive either by recoding or by computing to produce 12 new variables
(„CON01‟ to „CON12‟).
[If using Transform Recode ... the recoding steps are 15, 24, 33, 42, 51.]
[If using Transform Compute Variable the equation is „new variable‟ = 6 – „old variable‟.]
Step 2: Use Compute Variable to produce a new variable „CONTOTAL‟ by adding together the 12
variables resulting from Step 1.
Step 3: Note that with individual scores ranging from 1 to 5, the range of 12 variables added will not
have „0‟ as its minimum. So a constant must be subtracted from „CONTOTAL‟ to produce
„CONSCALE‟.
Method 2
The above can be done all in one step using the Compute Variable procedure – but it is not easy to
get right first time. It is wise to write down the equation on paper and test it „by hand‟ first. Make sure
„0‟ is the minimum achievable.
(b) Use Frequencies or some other procedure to compare the distribution for „CONSCALE‟ with that of
„CONFIDENCE‟ which is a variable already in the dataset. They should match exactly.
(c) Having created the 12 variables „CON01‟ to CON12‟, perform a Reliability Analysis to calculate
Cronbachs‟ Alpha reliability score, and determine whether removal of any of the 12 variables would
improve the score.
FACTOR ANALYSIS
22. Perform a Factor Analysis on the ten variables Usoc01 to Usoc10 in the Initial Questionnaire. Discuss
your findings. (You should find three components.) (See 23 for related analysis.)
23. Perform a Factor Analysis on the ten variables Usoc01F to Usoc10F in the Final Questionnaire (the
questions were identical to those for Usoc01 to Usoc10 analysed in 22). Discuss your findings and
compare with those from 10. (You should find two components.)
24. Perform a Factor Analysis on the 22 variables Uself01F to Uself12F and Usoc01F to Usoc10F which
purport to cover the two aspects of usefulness – for oneself and for society. Discuss your findings. (You
should find three components.)
Guide to SPSS for Information Science 229
This data file contains data on IT piracy for 109 countries, published by Business Software Alliance,
who define the IT piracy rate as the percentage of all software in use which is pirated. See the end of
this Guide for details of their annual reports which explain the methodology and provide the raw data.
Cases: 109
Variables: 21: Country, Region of the world, Region of the world (as used by BSA), Population (2008,
2009, 2010), GDP (2008, 2009, 2010), IT piracy rates for 2005 to 2010, IT piracy values
(US$ millions) for 2005 to 2010.
Data sources: Business Software Alliance, The World Bank, The IMF.
QUESTIONS
1. Which region had the highest (a) piracy rates and (b) piracy values in each of the years 2005 to 2010?
2. Which 10 countries have the highest (a) piracy rates and (b) highest piracy values in 2010?
3. For Q2 are the 10 countries the same for (a) and (b)? If not why do you think there is a difference?
4. Which 10 countries have the least (a) piracy rates and (b) least piracy values in 2010?
5. For Q4 are the 10 countries the same for (a) and (b)? If not why do you think there is a difference?
6. (a) How strongly is piracy rate correlated to population size and with GDP?
(b) What is the correlation with population when the effect of GDP is removed? (i.e. find the partial
correlation).
(c) What is the correlation with GDP when the effect of population is removed? (i.e. find the partial
correlation).
7. (a) How strongly is piracy value correlated to population size and with GDP?
(b) What is its partial correlation with population when the effect of GDP is removed?
(c) What is its partial correlation with GDP when the effect of population is removed?
8. Do your answers to Q6 and Q7 show the same trends? What is your explanation?
9. Did the piracy rates increase or decrease over the period 2005-2010? (Hint: draw a graph of the
countries with the highest piracy rates for 2010 over the period 2005-2010, and another graph for
countries with the lowest piracy rates).
10. Do the same as in Q9 for piracy values. Do these show the same trends? What is your explanation?
12. Use the Chi-square test to compare the mean piracy rates by region.
13. Calculate the piracy value per capita for each country, and compare the mean by region.
Guide to SPSS for Information Science 234
This data file contains statistics for 157 of the world‟s largest countries (population at least 1 million in
2011) for whom the information is available, provided by Internet World Stats (IWS). See the end of
this Guide for links to the IWS reports and datasets.
Cases: 157
Variables: 6: Country, region of world, population mid-2011 (estimate March 2011), GDP in purchasing
power parity (PPP) per capita per annum in international dollars (estimates at April 2011),
number of Facebook users (June 2011).
QUESTIONS
1. Calculate the Facebook penetration rate for each country, defined as:
2. What is the correlation between population size and penetration? What do you conclude?
4. Does the region affect the Facebook penetration rate? (Hint: Perform a suitable test to compare mean
penetration rates for the regions.)
5. Does the size of a country‟s population affect the Facebook penetration rate? (Hint: Assign the countries
to different size categories, and perform a suitable test to compare mean penetration rates for the
categories.)
Guide to SPSS for Information Science 235
This data file contains statistics on internet usage for 31 European countries, provided by Europa. See the
end of this Guide for links to the Europa report and datasets.
Cases: 31
Variables: 9: Country, region within Europe, GNI (Gross National Income) in purchasing power parity
(PPP) per capita per annum in US$ 2009, Internet usage rate by 16-24 year-olds in 2009,
Internet usage rate by 16-74 year-olds in 2009, Internet access rate in 2007 and in 2009,
Internet buyer rate for males and for females 2009.
QUESTIONS
1. Calculate for each country the Internet penetration rate for 2007 and for 2009, defined as:
2. Are there significant regional differences in the Internet penetration rates found in Q1? If so, is there an
explanation?
3. Calculate for each country the internet user growth rate from 2007 to 2009, defined as:
4. Are there significant regional differences in the Internet user growth rates found in Q3? If so, is there an
explanation?
5. Is there a significant correlation between population size and number of internet users? Investigate
separately for each year. What is the explanation of your findings?
6. Is there a significant correlation between population size and internet user growth rate? What is the
explanation of your findings?
7. Is there a significant correlation between GNI and number of internet users? Investigate separately for
each year. What is the explanation of your findings?
8. Is there a significant correlation between GNI and Internet penetration rate? Investigate separately for
each year. What is the explanation of your findings?
9. Is there a significant correlation between GNI and internet user growth rate? What is the explanation of
your findings?
Guide to SPSS for Information Science 236
10. Is there a significant correlation between GNI and Internet penetration rate? What is the explanation of
your findings?
11. Are there differences in usage rates for younger people and the whole population? What is the
explanation of your findings?
[Note: the data for 16-24 year-old is included in the data for 16-74 year-olds, and unfortunately cannot be
separated out as only rates and not numbers are provided. It could be done approximately if the
percentage of 16-24 year olds in the population of 16-74 year olds were found.]
12. Are there regional differences in Q11? What is the explanation of your findings?
13. Are there differences in usage rates for males and females? What is the explanation of your findings?
14. Are there regional differences in Q13? What is the explanation of your findings?
Guide to SPSS for Information Science 237
This data file contains demographic statistics for 155 of the world‟s largest countries (population at least 1
million in 2010) for whom at least some of the information is available, provided by UNESCO Institute for
Statistics, The World Bank and Internet World Stats (IWS). See the end of this Guide for links to their reports
and datasets.
Cases: 155
Variables: 15: Country, region of the world, subregion, land area, adult literacy rate (latest data),
population (2008, 2009, 2010), GNI per capita (2008, 2009, 2010) in US$, GDP total (2008,
2009, 2010) in US$ millions.
Data sources: UNESCO Institute for Statistics, World Bank, Internet World Stats.
QUESTIONS
LITERACY
2. Is there a significant correlation between Literacy rate and GNI per capita? If so, what is the
explanation?
3. Is there a significant correlation between Literacy rate and GDP total? What is the explanation?
4. What is the average person‟s Literacy rate? (NB This is not the average of the rates for countries.)
5. Using the variables Region, GNI per capita and GDP to find the regression equation to predict a
country‟s literacy rate.
6. Calculate for each country the GDP per capita for each year (2008, 2009, 2010) and, using correlation,
compare GDP per capita with GNI per capita. What do you expect to find? What is the explanation of
what you actually find?
7. Does GNI per capita vary significantly between the regions of the world?
8. What is the average person‟s GNI per capita? (NB This is not the average of the rates for countries.)
9. Comparing the average GNI per capita for each year, is there a trend? What is the explanation?
10. Comparing the total GDP for each year, is there a trend? What is the explanation? Compare with Q9.
Guide to SPSS for Information Science 238
POPULATION
11. Calculate for each country the average annual population growth rate from 2008 to 2010, defined as:
12. Are there significant regional differences in their population growth rates? If so, is there an explanation?
13. Is there a significant correlation between population size (based on 2008) and population growth rate?
What is the explanation of your findings?
14. Are there significant differences in the population growth rates depending on the size of the country? If
so, is there an explanation?
15. Estimate for the world the annual population growth rate from 2008 to 2010. What factors may limit the
accuracy (validity) of this calculation?
This data file contains internet user statistics for the seven regions of the world for the Years 2000 and 2011.
Cases: 7
Variables: 3: Region of the World, Population of the Region in Year 2011 (millions), Number of Internet
Users in the Region in Year 2011 (millions),
Data sources
We gratefully acknowledge permission to use data from the following sources.
Population of Countries
The World Bank. Data > Indicators > Population, total.
<http://data.worldbank.org/indicator/SP.POP.TOTL>, [2011], [accessed 04.08.11].
International Monetary Fund. Data and Statistics > Data > World Economic Outlook Databases (WEO)
> By Countries > 1 Select Country Group > 2 Select Country > 3 Select Subjects > People >
Population > 4 Select Date Range > Prepare Report. <http://www.imf.org/external/index.htm>, [25 July
2011], [accessed 04.08.11].
Gross National Income (GNI) and Gross Domestic Product (GDP) for Countries
The World Bank. Data > Indicators > GDP (current US$).
<http://data.worldbank.org/indicator/NY.GDP.MKTP>, [2011], [accessed 04.08.11].
The World Bank. Data > Indicators > GNI per capita, Atlas method (current US$).
<http://data.worldbank.org/indicator/NY.GNP.PCAP.CD>, [2011], [accessed 04.08.11].
International Monetary Fund. Data and Statistics > Data > World Economic Outlook Databases (WEO)
> By Countries > 1 Select Country Group > 2 Select Country > 3 Select Subjects > National Accounts
> Gross domestic product per capita, current prices U.S. dollars > 4 Select Date Range > Prepare
Report. <http://www.imf.org/external/index.htm>, [25 July 2011], [accessed 04.08.11].
International Monetary Fund. Data and Statistics > Data > World Economic Outlook Databases (WEO)
> By Countries > 1 Select Country Group > 2 Select Country > 3 Select Subjects > National Accounts
> Gross domestic product, current prices U.S. dollars > 4 Select Date Range > Prepare Report.
<http://www.imf.org/external/index.htm>, [25 July 2011], [accessed 04.08.11].
Guide to SPSS for Information Science 240
Regional and individual country data are available. Updates occur regularly:
Americas: <http://www.internetworldstats.com/stats2.htm>
Language used: <http://www.internetworldstats.com/stats7.htm>
European Union: <http://www.internetworldstats.com/stats9.htm>
Caribbean: <http://www.internetworldstats.com/stats11.htm>
Central America: <http://www.internetworldstats.com/stats12.htm>
Spanish Speaking: <http://www.internetworldstats.com/stats13.htm>
South America: <http://www.internetworldstats.com/stats15.htm>
Business Software Alliance. Seventh Annual BSA/IDC Global Software Piracy Study May 2010 – Study in
Brief. <http://portal.bsa.org/globalpiracy2009/studies/09%20Piracy_In%20Brief_A4_111010.pdf>, [May
2010], [accessed 04.08.11].
This has annual data for 2005-2009.
Guide to SPSS for Information Science 241
On the date accessed (04.08.11) the above led to the following (updates occur regularly):
Note: Many developed countries no longer collect or publish literacy rate data. For these a rate of 99.0% is
assumed.
Additional sources
In a few cases GDP, population and literacy data could not be found from the above sources, for which
recourse was made elsewhere:
There are many disputed territories and very small „nations‟. These have generally been omitted from our
datasets. For political reasons the UN is not allowed to refer to Taiwan (despite being autonomous with a
population of nearing 24 million!). Data about Taiwan can be hard to come by.
Interestingly, Hong Kong and the much smaller Macao are usually reported separately from China (being
SARs – Special Administrative Regions), whereas Taiwan is reported by The International Monetary Fund
as „Taiwan, Province of China‟ and simply omitted from tables by The United Nations and The Word Bank.
For an interesting discussion of what constitutes a country, territory, colony, dependency or other nation
group see:
http://geography.about.co./countries/a/numbercountries.htm