Chapter 1: Hello Data!
Study Theme 1: Introduction to Data & Exploratory Data Analysis
Dr Rene Stander
STC 122
2025
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 1 / 23
Super Animals
Let’s explore a dataset on animals across different ecosystems.
Import the dataset
animals <- read.csv("super-animals.csv")
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 2 / 23
Structure of data
Data is a collection of information.
Typically organised in rows and columns (a data frame or table).
Each row: an observation or case.
Each column: a variable (a characteristic or attribute)
Tidy data: each row is a unique case (observational unit), each
column is a variable, and each cell is a single value.
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 3 / 23
Data in R
Dimensions of the dataset
dim(animals)
## [1] 108 10
This dataset has 108 observations (rows) and 10 variables (columns)
Extract the Speed variable and save it into a variable called
animal_speed
animal_speed <- animals$Speed
How will you only get the number of rows? And only the number of
columns?
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 4 / 23
Types of variables
Variables can either be numerical (quantitative) or categorical
(qualitative).
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 5 / 23
Types of variables (Examples)
Numerical: Discrete
Variable Example values
Number of pets 1, 2, 0
Goals scored 0, 1, 4, 7
Number of children 0, 1, 2, 3
Numerical: Continuous
Variable Example values
Height 172.5, 160.0, 185.3
Temperature 23.4, 18.0, 31.2
Blood pressure 120.5, 130.2, 110.8
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 6 / 23
Types of variables (Examples)
Categorical: Nominal
Variable Example values
Gender Male, Female
Blood type A, B, AB, O
Eye colour Blue, Green, Brown
Categorical: Ordinal
Variable Example values
Education level High school < Bachelor’s < Master’s
Satisfaction rating Very dissatisfied < Neutral < Very satisfied
Pain severity Mild < Moderate < Severe
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 7 / 23
Variables in the Super Animals dataset
Preview variables in the dataset
str(animals)
## 'data.frame': 108 obs. of 10 variables:
## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : chr "Savanna" "Savanna" "Savanna" "Savanna" ..
## $ Species : chr "Bird" "Bird" "Mammal" "Mammal" ...
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: int 1 2 4 4 4 4 1 2 4 4 ...
## $ Updated : chr "No" "Yes" "Yes" "No" ...
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 8 / 23
Variables in the Super Animals dataset
Get a list of the variable names
names(animals)
## [1] "Number" "Animal" "Category" "Species"
## [5] "Age" "Weight" "Size" "Speed"
## [9] "Vulnerability" "Updated"
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 9 / 23
Variables in the Super Animals dataset
Let’s classify each variable in the dataset based on its type:
Category: Categorical Nominal
Species: Categorical Nominal
Age: Numerical Continuous
Weight: Numerical Continuous
Size: Numerical Continuous
Speed: Numerical Continuous
Vulnerability: Categorical Ordinal
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 10 / 23
Variable types in R
In R, variables (columns) can have different types. Common types include:
numeric: Real numbers (e.g., 3.14, 42)
integer: Whole numbers (e.g., 1L, 5L)
character: Text or strings
factor: Categorical variables with levels
logical: TRUE or FALSE
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 11 / 23
Variable types in R
class(): Describes how R interprets the object (e.g., “data.frame”,
“factor”, “numeric”)
class(animals$Weight)
## [1] "numeric"
typeof(): Describes the internal storage type (e.g., “double”,
“integer”, “character”)
typeof(animals$Weight)
## [1] "double"
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 12 / 23
Changing the variable type in R
In the super animals dataset, Vulnerability is a categorical (ordinal)
variable.
If we have a look at the first 10 entries of the Vulnerability variable, we
see that the values of the variable are 1, 2, 3 and 4.
animals$Vulnerability[1:10]
## [1] 1 2 4 4 4 4 1 2 4 4
In R, this variable is seen as “integers”.
class(animals$Vulnerability)
## [1] "integer"
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 13 / 23
Changing the variable type in R
We need to change how R sees and handles this variable. Change the
variable to an ordered factor with the following levels:
1 = Endangered (EN)
2 = Vulnerability (VU)
3 = Near Threatened (NT)
4 = Least Concern (LC)
animals$Vulnerability <- factor(animals$Vulnerability,
levels = c(1,2,3,4),
labels = c("EN", "VU",
"NT", "LC"),
ordered = TRUE)
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 14 / 23
Changing the variable type in R
class(animals$Vulnerability)
## [1] "ordered" "factor"
levels(animals$Vulnerability)
## [1] "EN" "VU" "NT" "LC"
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 15 / 23
Changing the variable type in R
The variables Category and Species are also categorical variables.
Therefore, we need to change the variable type to factor.
animals$Category <- factor(animals$Category)
animals$Species <- factor(animals$Species)
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 16 / 23
str(animals)
## 'data.frame': 108 obs. of 10 variables:
## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : Factor w/ 9 levels "Coast","Forest",..: 6 6 6 6
## $ Species : Factor w/ 7 levels "Amphibian","Bird",..: 2 2 6
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: Ord.factor w/ 4 levels "EN"<"VU"<"NT"<..: 1 2 4
## $ Updated : chr "No" "Yes" "Yes" "No" ...
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 17 / 23
Subset data based on characteristics
Extract the observations of the birds from the data.
1 The type of species is indicated in the Species variable. First, have a
look at what is included in this variable.
# Extract the first 5 values
animals$Species[1:5]
## [1] Bird Bird Mammal Mammal Mammal
## Levels: Amphibian Bird Crustacean Fish Insect Mammal Reptile
# Display the different categories
# within the variable
levels(animals$Species)
## [1] "Amphibian" "Bird" "Crustacean" "Fish" "Insect"
## [6] "Mammal" "Reptile"
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 18 / 23
Subset data based on characteristics
2 Subset the animals dataset based on the Species variable. We want
to save all the observations where the species are indicated as “Bird”.
birds <- subset(animals, Species == "Bird")
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 19 / 23
Association vs. Causation
Association: A relationship exists between two variables.
▶ Example: Ice cream sales and drowning incidents.
Causation: One variable directly affects the other.
▶ Example: Taking antibiotics reduces bacterial infections.
Correlation does not imply causation!
Why Does This Matter?
Mistaking association for causation can lead to false conclusions.
Exploratory Data Analysis (EDA) finds patterns, not proof.
Even if two variables move together, it doesn’t mean one causes the
other.
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 20 / 23
Example: county dataset
Scatterplots are one type of graph used to study the relationship
between two numerical variables.
When two variables show some connection with one another, they are
called associated variables
If two variables are not associated, then they are said to be
independent - meaning there’s no visible relationship.
Negative association Positive association
40
30
80
20
60
Homeownership rate
Population change
10
over 7 years
0
40
−10
20
−20
−30
0
0 20 40 60 80 100 20000 40000 60000 80000 100000 120000
Percent of housing units that
are multi−unit structures Median household income
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 21 / 23
Explanatory and response variables
When we ask questions about the relationship between two variables, we
sometimes also want to determine if the change in one variable causes a
change in the other.
If there is an increase in the median household income in a county, does this
drive an increase in its population?
In this question, we are asking whether one variable affects another:
Explanatory variable: Median household income
Response variable: Population change
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 22 / 23
Explanatory and response variables
When we suspect one variable might causally affect another, we label the
first variable the explanatory variable and the second the response variable.
We also use the terms explanatory and response to describe variables.
The response might be predicted using the explanatory even if there is no
causal relationship.
Explanatory variable -> might affect -> response variable
Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 23 / 23