[go: up one dir, main page]

0% found this document useful (0 votes)
62 views23 pages

Study Theme 1 - Chapter 1 - Hello Data

Chapter 1 introduces data and exploratory data analysis, focusing on a dataset of animals with 108 observations and 10 variables. It covers data structure, variable types (numerical and categorical), and how to manipulate these variables in R, including changing variable types and subsetting data. The chapter also discusses the distinction between association and causation, emphasizing the importance of understanding relationships between variables.

Uploaded by

mukelweshongwe16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views23 pages

Study Theme 1 - Chapter 1 - Hello Data

Chapter 1 introduces data and exploratory data analysis, focusing on a dataset of animals with 108 observations and 10 variables. It covers data structure, variable types (numerical and categorical), and how to manipulate these variables in R, including changing variable types and subsetting data. The chapter also discusses the distinction between association and causation, emphasizing the importance of understanding relationships between variables.

Uploaded by

mukelweshongwe16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 1: Hello Data!

Study Theme 1: Introduction to Data & Exploratory Data Analysis

Dr Rene Stander

STC 122

2025

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 1 / 23


Super Animals
Let’s explore a dataset on animals across different ecosystems.

Import the dataset


animals <- read.csv("super-animals.csv")

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 2 / 23


Structure of data
Data is a collection of information.
Typically organised in rows and columns (a data frame or table).

Each row: an observation or case.


Each column: a variable (a characteristic or attribute)
Tidy data: each row is a unique case (observational unit), each
column is a variable, and each cell is a single value.

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 3 / 23


Data in R

Dimensions of the dataset


dim(animals)

## [1] 108 10

This dataset has 108 observations (rows) and 10 variables (columns)

Extract the Speed variable and save it into a variable called


animal_speed
animal_speed <- animals$Speed

How will you only get the number of rows? And only the number of
columns?

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 4 / 23


Types of variables

Variables can either be numerical (quantitative) or categorical


(qualitative).

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 5 / 23


Types of variables (Examples)
Numerical: Discrete

Variable Example values


Number of pets 1, 2, 0
Goals scored 0, 1, 4, 7
Number of children 0, 1, 2, 3

Numerical: Continuous

Variable Example values


Height 172.5, 160.0, 185.3
Temperature 23.4, 18.0, 31.2
Blood pressure 120.5, 130.2, 110.8

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 6 / 23


Types of variables (Examples)
Categorical: Nominal

Variable Example values


Gender Male, Female
Blood type A, B, AB, O
Eye colour Blue, Green, Brown

Categorical: Ordinal

Variable Example values


Education level High school < Bachelor’s < Master’s
Satisfaction rating Very dissatisfied < Neutral < Very satisfied
Pain severity Mild < Moderate < Severe

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 7 / 23


Variables in the Super Animals dataset

Preview variables in the dataset


str(animals)

## 'data.frame': 108 obs. of 10 variables:


## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : chr "Savanna" "Savanna" "Savanna" "Savanna" ..
## $ Species : chr "Bird" "Bird" "Mammal" "Mammal" ...
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: int 1 2 4 4 4 4 1 2 4 4 ...
## $ Updated : chr "No" "Yes" "Yes" "No" ...

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 8 / 23


Variables in the Super Animals dataset

Get a list of the variable names


names(animals)

## [1] "Number" "Animal" "Category" "Species"


## [5] "Age" "Weight" "Size" "Speed"
## [9] "Vulnerability" "Updated"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 9 / 23


Variables in the Super Animals dataset

Let’s classify each variable in the dataset based on its type:


Category: Categorical Nominal
Species: Categorical Nominal
Age: Numerical Continuous
Weight: Numerical Continuous
Size: Numerical Continuous
Speed: Numerical Continuous
Vulnerability: Categorical Ordinal

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 10 / 23


Variable types in R

In R, variables (columns) can have different types. Common types include:


numeric: Real numbers (e.g., 3.14, 42)
integer: Whole numbers (e.g., 1L, 5L)
character: Text or strings
factor: Categorical variables with levels
logical: TRUE or FALSE

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 11 / 23


Variable types in R

class(): Describes how R interprets the object (e.g., “data.frame”,


“factor”, “numeric”)
class(animals$Weight)

## [1] "numeric"

typeof(): Describes the internal storage type (e.g., “double”,


“integer”, “character”)
typeof(animals$Weight)

## [1] "double"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 12 / 23


Changing the variable type in R

In the super animals dataset, Vulnerability is a categorical (ordinal)


variable.
If we have a look at the first 10 entries of the Vulnerability variable, we
see that the values of the variable are 1, 2, 3 and 4.
animals$Vulnerability[1:10]

## [1] 1 2 4 4 4 4 1 2 4 4

In R, this variable is seen as “integers”.


class(animals$Vulnerability)

## [1] "integer"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 13 / 23


Changing the variable type in R

We need to change how R sees and handles this variable. Change the
variable to an ordered factor with the following levels:
1 = Endangered (EN)
2 = Vulnerability (VU)
3 = Near Threatened (NT)
4 = Least Concern (LC)

animals$Vulnerability <- factor(animals$Vulnerability,


levels = c(1,2,3,4),
labels = c("EN", "VU",
"NT", "LC"),
ordered = TRUE)

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 14 / 23


Changing the variable type in R

class(animals$Vulnerability)

## [1] "ordered" "factor"


levels(animals$Vulnerability)

## [1] "EN" "VU" "NT" "LC"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 15 / 23


Changing the variable type in R

The variables Category and Species are also categorical variables.


Therefore, we need to change the variable type to factor.
animals$Category <- factor(animals$Category)
animals$Species <- factor(animals$Species)

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 16 / 23


str(animals)

## 'data.frame': 108 obs. of 10 variables:


## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : Factor w/ 9 levels "Coast","Forest",..: 6 6 6 6
## $ Species : Factor w/ 7 levels "Amphibian","Bird",..: 2 2 6
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: Ord.factor w/ 4 levels "EN"<"VU"<"NT"<..: 1 2 4
## $ Updated : chr "No" "Yes" "Yes" "No" ...

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 17 / 23


Subset data based on characteristics

Extract the observations of the birds from the data.


1 The type of species is indicated in the Species variable. First, have a
look at what is included in this variable.
# Extract the first 5 values
animals$Species[1:5]

## [1] Bird Bird Mammal Mammal Mammal


## Levels: Amphibian Bird Crustacean Fish Insect Mammal Reptile
# Display the different categories
# within the variable
levels(animals$Species)

## [1] "Amphibian" "Bird" "Crustacean" "Fish" "Insect"


## [6] "Mammal" "Reptile"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 18 / 23


Subset data based on characteristics

2 Subset the animals dataset based on the Species variable. We want


to save all the observations where the species are indicated as “Bird”.

birds <- subset(animals, Species == "Bird")

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 19 / 23


Association vs. Causation

Association: A relationship exists between two variables.


▶ Example: Ice cream sales and drowning incidents.
Causation: One variable directly affects the other.
▶ Example: Taking antibiotics reduces bacterial infections.

Correlation does not imply causation!

Why Does This Matter?


Mistaking association for causation can lead to false conclusions.
Exploratory Data Analysis (EDA) finds patterns, not proof.
Even if two variables move together, it doesn’t mean one causes the
other.

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 20 / 23


Example: county dataset
Scatterplots are one type of graph used to study the relationship
between two numerical variables.
When two variables show some connection with one another, they are
called associated variables
If two variables are not associated, then they are said to be
independent - meaning there’s no visible relationship.
Negative association Positive association

40
30
80

20
60
Homeownership rate

Population change

10
over 7 years

0
40

−10
20

−20
−30
0

0 20 40 60 80 100 20000 40000 60000 80000 100000 120000


Percent of housing units that
are multi−unit structures Median household income

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 21 / 23


Explanatory and response variables

When we ask questions about the relationship between two variables, we


sometimes also want to determine if the change in one variable causes a
change in the other.

If there is an increase in the median household income in a county, does this


drive an increase in its population?

In this question, we are asking whether one variable affects another:


Explanatory variable: Median household income
Response variable: Population change

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 22 / 23


Explanatory and response variables

When we suspect one variable might causally affect another, we label the
first variable the explanatory variable and the second the response variable.
We also use the terms explanatory and response to describe variables.
The response might be predicted using the explanatory even if there is no
causal relationship.

Explanatory variable -> might affect -> response variable

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 23 / 23

You might also like