0% found this document useful (0 votes)

62 views23 pages

Study Theme 1 - Chapter 1 - Hello Data

Chapter 1 introduces data and exploratory data analysis, focusing on a dataset of animals with 108 observations and 10 variables. It covers data structure, variable types (numerical and categorical), and how to manipulate these variables in R, including changing variable types and subsetting data. The chapter also discusses the distinction between association and causation, emphasizing the importance of understanding relationships between variables.

Uploaded by

mukelweshongwe16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views23 pages

Study Theme 1 - Chapter 1 - Hello Data

Uploaded by

mukelweshongwe16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 1: Hello Data!

Study Theme 1: Introduction to Data & Exploratory Data Analysis

Dr Rene Stander

STC 122

2025

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 1 / 23

Super Animals
Let’s explore a dataset on animals across different ecosystems.

Import the dataset

animals <- read.csv("super-animals.csv")

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 2 / 23

Structure of data
Data is a collection of information.
Typically organised in rows and columns (a data frame or table).

Each row: an observation or case.

Each column: a variable (a characteristic or attribute)
Tidy data: each row is a unique case (observational unit), each
column is a variable, and each cell is a single value.

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 3 / 23

Data in R

Dimensions of the dataset

dim(animals)

## [1] 108 10

This dataset has 108 observations (rows) and 10 variables (columns)

Extract the Speed variable and save it into a variable called

animal_speed
animal_speed <- animals$Speed

How will you only get the number of rows? And only the number of
columns?

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 4 / 23

Types of variables

Variables can either be numerical (quantitative) or categorical

(qualitative).

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 5 / 23

Types of variables (Examples)
Numerical: Discrete

Variable Example values

Number of pets 1, 2, 0
Goals scored 0, 1, 4, 7
Number of children 0, 1, 2, 3

Numerical: Continuous

Variable Example values

Height 172.5, 160.0, 185.3
Temperature 23.4, 18.0, 31.2
Blood pressure 120.5, 130.2, 110.8

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 6 / 23

Types of variables (Examples)
Categorical: Nominal

Variable Example values

Gender Male, Female
Blood type A, B, AB, O
Eye colour Blue, Green, Brown

Categorical: Ordinal

Variable Example values

Education level High school < Bachelor’s < Master’s
Satisfaction rating Very dissatisfied < Neutral < Very satisfied
Pain severity Mild < Moderate < Severe

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 7 / 23

Variables in the Super Animals dataset

Preview variables in the dataset

str(animals)

## 'data.frame': 108 obs. of 10 variables:

## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : chr "Savanna" "Savanna" "Savanna" "Savanna" ..
## $ Species : chr "Bird" "Bird" "Mammal" "Mammal" ...
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: int 1 2 4 4 4 4 1 2 4 4 ...
## $ Updated : chr "No" "Yes" "Yes" "No" ...

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 8 / 23

Variables in the Super Animals dataset

Get a list of the variable names

names(animals)

## [1] "Number" "Animal" "Category" "Species"

## [5] "Age" "Weight" "Size" "Speed"
## [9] "Vulnerability" "Updated"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 9 / 23

Variables in the Super Animals dataset

Let’s classify each variable in the dataset based on its type:

Category: Categorical Nominal
Species: Categorical Nominal
Age: Numerical Continuous
Weight: Numerical Continuous
Size: Numerical Continuous
Speed: Numerical Continuous
Vulnerability: Categorical Ordinal

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 10 / 23

Variable types in R

In R, variables (columns) can have different types. Common types include:

numeric: Real numbers (e.g., 3.14, 42)
integer: Whole numbers (e.g., 1L, 5L)
character: Text or strings
factor: Categorical variables with levels
logical: TRUE or FALSE

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 11 / 23

Variable types in R

class(): Describes how R interprets the object (e.g., “data.frame”,

“factor”, “numeric”)
class(animals$Weight)

## [1] "numeric"

typeof(): Describes the internal storage type (e.g., “double”,

“integer”, “character”)
typeof(animals$Weight)

## [1] "double"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 12 / 23

Changing the variable type in R

In the super animals dataset, Vulnerability is a categorical (ordinal)

variable.
If we have a look at the first 10 entries of the Vulnerability variable, we
see that the values of the variable are 1, 2, 3 and 4.
animals$Vulnerability[1:10]

## [1] 1 2 4 4 4 4 1 2 4 4

In R, this variable is seen as “integers”.

class(animals$Vulnerability)

## [1] "integer"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 13 / 23

Changing the variable type in R

We need to change how R sees and handles this variable. Change the
variable to an ordered factor with the following levels:
1 = Endangered (EN)
2 = Vulnerability (VU)
3 = Near Threatened (NT)
4 = Least Concern (LC)

animals$Vulnerability <- factor(animals$Vulnerability,

levels = c(1,2,3,4),
labels = c("EN", "VU",
"NT", "LC"),
ordered = TRUE)

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 14 / 23

Changing the variable type in R

class(animals$Vulnerability)

## [1] "ordered" "factor"

levels(animals$Vulnerability)

## [1] "EN" "VU" "NT" "LC"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 15 / 23

Changing the variable type in R

The variables Category and Species are also categorical variables.

Therefore, we need to change the variable type to factor.
animals$Category <- factor(animals$Category)
animals$Species <- factor(animals$Species)

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 16 / 23

str(animals)

## 'data.frame': 108 obs. of 10 variables:

## $ Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Animal : chr "White headed vulture" "Secretarybird" "Ku
## $ Category : Factor w/ 9 levels "Coast","Forest",..: 6 6 6 6
## $ Species : Factor w/ 7 levels "Amphibian","Bird",..: 2 2 6
## $ Age : num 20 12 15 20 25 20 50 25 3 15 ...
## $ Weight : num 4.7 4 227 250 828 54 1000 190 0.02 0.1 ...
## $ Size : num 85 152 240 240 505 162 380 250 3 38 ...
## $ Speed : num 48 65 70 80 56 80 55 80 30 54 ...
## $ Vulnerability: Ord.factor w/ 4 levels "EN"<"VU"<"NT"<..: 1 2 4
## $ Updated : chr "No" "Yes" "Yes" "No" ...

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 17 / 23

Subset data based on characteristics

Extract the observations of the birds from the data.

1 The type of species is indicated in the Species variable. First, have a
look at what is included in this variable.
# Extract the first 5 values
animals$Species[1:5]

## [1] Bird Bird Mammal Mammal Mammal

## Levels: Amphibian Bird Crustacean Fish Insect Mammal Reptile
# Display the different categories
# within the variable
levels(animals$Species)

## [1] "Amphibian" "Bird" "Crustacean" "Fish" "Insect"

## [6] "Mammal" "Reptile"

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 18 / 23

Subset data based on characteristics

2 Subset the animals dataset based on the Species variable. We want

to save all the observations where the species are indicated as “Bird”.

birds <- subset(animals, Species == "Bird")

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 19 / 23

Association vs. Causation

Association: A relationship exists between two variables.

▶ Example: Ice cream sales and drowning incidents.
Causation: One variable directly affects the other.
▶ Example: Taking antibiotics reduces bacterial infections.

Correlation does not imply causation!

Why Does This Matter?

Mistaking association for causation can lead to false conclusions.
Exploratory Data Analysis (EDA) finds patterns, not proof.
Even if two variables move together, it doesn’t mean one causes the
other.

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 20 / 23

Example: county dataset
Scatterplots are one type of graph used to study the relationship
between two numerical variables.
When two variables show some connection with one another, they are
called associated variables
If two variables are not associated, then they are said to be
independent - meaning there’s no visible relationship.
Negative association Positive association

40
30
80

20
60
Homeownership rate

Population change

10
over 7 years

0
40

−10
20

−20
−30
0

0 20 40 60 80 100 20000 40000 60000 80000 100000 120000

Percent of housing units that
are multi−unit structures Median household income

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 21 / 23

Explanatory and response variables

When we ask questions about the relationship between two variables, we

sometimes also want to determine if the change in one variable causes a
change in the other.

If there is an increase in the median household income in a county, does this

drive an increase in its population?

In this question, we are asking whether one variable affects another:

Explanatory variable: Median household income
Response variable: Population change

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 22 / 23

Explanatory and response variables

When we suspect one variable might causally affect another, we label the
first variable the explanatory variable and the second the response variable.
We also use the terms explanatory and response to describe variables.
The response might be predicted using the explanatory even if there is no
causal relationship.

Explanatory variable -> might affect -> response variable

Dr Rene Stander (STC 122) Chapter 1: Hello Data! 2025 23 / 23

R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
Notes 3
No ratings yet
Notes 3
19 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
3 pages
DA Lab Week-1
No ratings yet
DA Lab Week-1
7 pages
Computing For Research I: Spring 2012
No ratings yet
Computing For Research I: Spring 2012
34 pages
R-Training For Print
No ratings yet
R-Training For Print
11 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
(Ebook PDF) The Analysis of Biological Data Second Edition PDF Download
100% (4)
(Ebook PDF) The Analysis of Biological Data Second Edition PDF Download
61 pages
Presentation of R
No ratings yet
Presentation of R
109 pages
Data Analysis for Math Students
No ratings yet
Data Analysis for Math Students
19 pages
Biometry Lecture 1
No ratings yet
Biometry Lecture 1
59 pages
Stata Syntax Guide for Beginners
No ratings yet
Stata Syntax Guide for Beginners
4 pages
Unit 3 Data Analysis
No ratings yet
Unit 3 Data Analysis
3 pages
Biostatistics - Prelim Transes
No ratings yet
Biostatistics - Prelim Transes
7 pages
R Module 7 - Data Classes
No ratings yet
R Module 7 - Data Classes
45 pages
1.introduction To Biostatistics
No ratings yet
1.introduction To Biostatistics
56 pages
Basic Statistical Concepts-1
No ratings yet
Basic Statistical Concepts-1
18 pages
An Introduction To Statistical Analysis
No ratings yet
An Introduction To Statistical Analysis
20 pages
Biostatistics & Computer Applications
No ratings yet
Biostatistics & Computer Applications
105 pages
02 Biostatistics - DrSikanderLectures
No ratings yet
02 Biostatistics - DrSikanderLectures
161 pages
BES - R Lab 1
No ratings yet
BES - R Lab 1
4 pages
2009 Mixed Model Analysis Stephen - Mbunai - Sonal - Nagda
No ratings yet
2009 Mixed Model Analysis Stephen - Mbunai - Sonal - Nagda
15 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Types of Variables (In Statistical Studies) - Definitions and Easy Examples
No ratings yet
Types of Variables (In Statistical Studies) - Definitions and Easy Examples
9 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Getting Started With Your Data: Using Stata
No ratings yet
Getting Started With Your Data: Using Stata
32 pages
Stoc
No ratings yet
Stoc
44 pages
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
No ratings yet
Dr. Nguyen Thi Van Anh Department of Biotechnology-Pharmacology
48 pages
Introduction To Biostatistics: DR Asim Waris
0% (1)
Introduction To Biostatistics: DR Asim Waris
37 pages
Unit 1 Introduction To Biostatistics
No ratings yet
Unit 1 Introduction To Biostatistics
45 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
Stata Note
No ratings yet
Stata Note
5 pages
1 - Introduction To Biostatistics-2
No ratings yet
1 - Introduction To Biostatistics-2
23 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Stata Data Managment
No ratings yet
Stata Data Managment
79 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
BioStats CIA1
No ratings yet
BioStats CIA1
10 pages
Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
No ratings yet
Introduction To STATA: Introduction To STATA About STATA Basic Operations Regression Analysis Panel Data Analysis
27 pages
Understanding Variables in Research
No ratings yet
Understanding Variables in Research
21 pages
Exploratory Data Analysis Module3
No ratings yet
Exploratory Data Analysis Module3
20 pages
PSY2801 Nov 21 - Interpreting Graphs
No ratings yet
PSY2801 Nov 21 - Interpreting Graphs
68 pages
R Statistics: Descriptive & Inferential
No ratings yet
R Statistics: Descriptive & Inferential
6 pages
1-2-3 Medical - Statistics
No ratings yet
1-2-3 Medical - Statistics
48 pages
Nature of Biostat
No ratings yet
Nature of Biostat
54 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Basic Data Types
No ratings yet
Basic Data Types
48 pages
Medical Students' Guide to Statistics
No ratings yet
Medical Students' Guide to Statistics
67 pages
Stoc PDF
No ratings yet
Stoc PDF
38 pages
01 IntroSlides
No ratings yet
01 IntroSlides
43 pages
STAT Lec1 2023
No ratings yet
STAT Lec1 2023
27 pages
BZAN6310 Chapter 2
No ratings yet
BZAN6310 Chapter 2
79 pages
Training at Gudar Campus
100% (1)
Training at Gudar Campus
83 pages
Rintro
No ratings yet
Rintro
42 pages
R Commands Good
No ratings yet
R Commands Good
2 pages
Biostatistics Introduction
No ratings yet
Biostatistics Introduction
52 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Working With Passwords, Secure Strings and Credentials in Windows PowerShell - TechNet Articles - United States (English) - TechNet Wiki
No ratings yet
Working With Passwords, Secure Strings and Credentials in Windows PowerShell - TechNet Articles - United States (English) - TechNet Wiki
2 pages
Script Termux Lengkap (SFILE
100% (4)
Script Termux Lengkap (SFILE
16 pages
Mcse 2003
No ratings yet
Mcse 2003
89 pages
Enrollment System
0% (1)
Enrollment System
8 pages
mql4 Manual PDF
No ratings yet
mql4 Manual PDF
162 pages
Computer Memory.......... : Presented By: Pooja Kushwah
No ratings yet
Computer Memory.......... : Presented By: Pooja Kushwah
13 pages
C Data Types Overview
No ratings yet
C Data Types Overview
22 pages
Oracle Database Essentials
No ratings yet
Oracle Database Essentials
4 pages
Oracle DataGuard for DBAs
No ratings yet
Oracle DataGuard for DBAs
57 pages
DBMS Architecture
No ratings yet
DBMS Architecture
7 pages
Peoplesoft Tuning PDF
No ratings yet
Peoplesoft Tuning PDF
40 pages
UNIT-3 XML Examples DTD XSD XSLT JDBC
No ratings yet
UNIT-3 XML Examples DTD XSD XSLT JDBC
13 pages
Panda Hacking Spring
No ratings yet
Panda Hacking Spring
62 pages
Drill-Down and Roll-Up in Analytical Processing - 8 Slides
No ratings yet
Drill-Down and Roll-Up in Analytical Processing - 8 Slides
3 pages
Intro To Ad Hoc
No ratings yet
Intro To Ad Hoc
15 pages
Debian + Proxmox + VYOS + NAT + IPSec + IPTables Fun
No ratings yet
Debian + Proxmox + VYOS + NAT + IPSec + IPTables Fun
17 pages
DVR Voyager 2016 PDF
No ratings yet
DVR Voyager 2016 PDF
56 pages
Oracle 12c Creating SecureFile LOBs On Import
100% (1)
Oracle 12c Creating SecureFile LOBs On Import
1 page
ISM LAB PRACTICAL FILE (Harsh)
No ratings yet
ISM LAB PRACTICAL FILE (Harsh)
21 pages
Counters, Registers and Memories
No ratings yet
Counters, Registers and Memories
25 pages
Cissp Notes
No ratings yet
Cissp Notes
4 pages
SQL DDL & DML Commands Guide
No ratings yet
SQL DDL & DML Commands Guide
142 pages
ARM MCQs
No ratings yet
ARM MCQs
16 pages
Top 100 + C Programming Interview Questions & Answers
100% (1)
Top 100 + C Programming Interview Questions & Answers
23 pages
CS614 Quiz-1 by Vu Topper RM
No ratings yet
CS614 Quiz-1 by Vu Topper RM
55 pages
Sme44370f VR3000 3000S PDF
100% (1)
Sme44370f VR3000 3000S PDF
271 pages
Linux Commands List From RAVI
No ratings yet
Linux Commands List From RAVI
72 pages
Asus p5b - Bupdater Bios Upgrade Procedure
No ratings yet
Asus p5b - Bupdater Bios Upgrade Procedure
4 pages
FB 4 Migrationguide
No ratings yet
FB 4 Migrationguide
24 pages
Instructor:: Semester Project Mam. Yella Mehroze
No ratings yet
Instructor:: Semester Project Mam. Yella Mehroze
7 pages