DSC2608 Learning_Unit_1
DSC2608 Learning_Unit_1
Programming in R
Learning objectives and outcomes: When you reach the end of learning unit 1, you should be able to
do the following:
1. Install and implement code using R programming language.
2. Demonstrate the use of expressions, assigning of variables, and operations in R.
3. Demonstrate the use of data types such as numbers and strings, including logical comparisons.
4. Demonstrate the use of sequences such as arrays and ranges.
5. Demonstrate the use of tables with row and/or column manipulation.
1.1.1 What is R and why is it useful in economics. R is a software environment and programming
language used for statistical computing and graphics. It is accessible to researchers, practitioners,
and students equally because it is an open source and publicly available. Owing to its adaptability,
functionality, and extension, R is widely used in a variety of disciplines, including economics.
R is useful in economics for several reasons. Here are some of the reasons:
• It offers a broad range of statistical and econometric approaches, including linear and nonlinear
modelling, time-series analysis, panel data analysis, and machine learning algorithms. These
methods give economists the ability to analyse economic relationships, test hypotheses, and make
predictions by analysing data.
• Thanks to an applied quantitative modelling curriculum, you can perform empirical research and
analyse economic data using the R language. By doing this, you can develop economic theory
and contribute knowledge for deciding on policies that have an effect on the real world.
• Given that many organisations increasingly demand data analytic competencies, having experience
with R can be advantageous in the job market. R coding is therefore very relevant, especially in
data-driven professions like economics. It is a transferrable skill that can be applied to many other
fields, including finance and banking, healthcare, and marketing.
Visit the following page for more details on why R is crucial in economics and why you should choose
it: https://www.core-econ.org/why-doing-economics-has-embraced-r/
In summary, students need to study R coding because it is a widely used language in the field of
econometrics and data analysis. R provides a powerful set of tools for data manipulation, visualisation,
and statistical analysis, and is widely used by researchers and practitioners in many fields. Learning R
can also help students to develop critical thinking skills and improve their ability to communicate their
findings.
By the end of this module, you will be able to create various types of data visualisations. Some
examples of the visualisations may include graphs, charts and other types of illustrations that will help
you to understand and communicate patterns and relationships in the data. You will be able to create
data visualisations, linear regression plots and classification tree diagrams at the end of the course, as
1
Section 1.1. Getting started with R Page 2
demonstrated by the examples in the following figures that illustrate boxplots, scatter plots, density
plots, histograms, etc.
Figure 1.1: Bar chart of the number of customers who have defaulted on credit card payments versus
those who have not
Figure 1.3: Scatter plot of the relationship between income and balance of customers’ credit cards
Figure 1.4: Boxplot of credit card balance for customers who defaulted versus those who have not
Section 1.1. Getting started with R Page 4
Figure 1.5: Density plot of credit card balance for customers who defaulted versus those who did not
Figure 1.6: Scatter plots of Sales against T V , Radio and N ewspaper advertisement budgets
Section 1.1. Getting started with R Page 5
Figure 1.7: Classification tree of def ault against income, balance, and student predictors
1.1.2 Finding and installing R and RStudio. The R Core Team maintains a network of servers that
contains installation files and documentation on R, called the Comprehensive R Archive Network, or
CRAN.
You can access it at: http://cran.r-project.org/ or https://cran.rstudio.com/ or a Google search for
CRAN R. See the R FAQ (r-project.org) for general information about R and the R for Windows FAQ
(rstudio.com) for Windows-specific information. R is available for Windows, Mac and Unix-like operating
systems. Installation files and instructions can be downloaded from the CRAN site. Note the following:
• Download the version compatible with your operating system (OS).
• R needs to be installed before RStudio is installed.
• RStudio facilitates communication between the user and the computer. To interact with R, RStu-
dio offers a user-friendly interface and a number of tools. It enables efficient writing, execution,
and management of R code. However, R is the actual programming language. Think of RStudio
as a tool that enables you to interact with the computer using R.
1.1.3 Getting started with RStudio. To get started, open RStudio just as you would open any other
application on the computer. The landing page shown in Figure 1.8 will appear on the screen. It usually
has four screens or panes, each of which serves a specific purpose.
Section 1.1. Getting started with R Page 6
RStudio has keyboard shortcuts for running all or some of the code in a script. The following are some
of the most useful shortcuts:
• Ctrl + Enter: Run current line or selection.
• Ctrl + Shift + Enter: Run all lines.
• Ctrl + Alt + B: Run from the beginning to the current line.
• Ctrl + Alt + E: Run from the current line to the end.
It is not necessary to memorise the shortcuts and commands as you can always refer to the cheatsheet
at this link: https://www.rstudio.com/resources/cheatsheets/. It gives you a quick way of accessing
some of the commands and syntax, which can always guide you through the most useful features of
RStudio, as well as the long list of keyboard shortcuts built into RStudio.
1.1.4 R Commands, assignment and objects. In order to use R, you need to learn the R language
and really not much more. The R language is a mix of functional and object-oriented styles.
In R, the instructions you provide are referred to as commands. These commands typically do not
require semicolons to indicate the end of a statement. Instead, they are usually terminated by starting
a new sentence. The assignment operator <- is used to assign a value. The variables you create in R
are called objects.
Note that the assignment queries will update objects in your R environment. Queries without assignment,
as well as the “call” of R objects, will either generate an output in the console or in the plot screen.
Section 1.1. Getting started with R Page 7
Arithmetic Relational
+ addition a == b Is a equal to b? (Do not confuse with =.)
- subtraction a != b Is a not equal to b?
* multiplication a < b Is a less than b?
/ division a > b Is a greater than b?
^ exponential a <= b Is a less than or equal b?
%/% integer division a >= b Is a greater than or equal b?
%% modulo (remainder)
Logic Indexing
! not $ part of a data frame or list
& and [ ] part of a data frame, array or list
| or [[ ]] part of a list
&& sequential and @ part of an S4 object
|| sequential or
isTrue check whether the logical
value is true
2 [1] 3.141593
3 > sqrt (25) # sqrt - define square root
4 [1] 5
5 > log (1) # logarithms
6 [1] 0
7 > log (1 , base = 10)
8 [1] 0
9 > exp (0) # mathematical constant e
10 [1] 1
To find out if a function exists in R, you can use the exists() function. It will either return TRUE or
FALSE. For instance, the built-in pi object display the value of the mathematical constant π, which is
roughly equivalent to 3.141593.
1 > exists ( " pi " )
2 [1] TRUE
1.2.2 Variable assignment and operation in R. In R, a variable is a fundamental element that enables
you to give a specific name to a particular datum and store it together with other similar data. For
example, you can assign names such as date, 6, or Hello to different sets of data. By doing this, you can
retrieve the stored data by calling the variable name. In programming, an identifier is a unique name
that you can assign to a variable, function or object to help you distinguish it from others.
The operator <- or = would be used for variable assignment. To see what is contained in a variable,
type the name and R will print the content.
The following R command creates an object named “a” and assigns the value 2022 to it. If “a” had
previously been created in the script, the original value would be overwritten. This means that objects
can be created and their data can be changed using the assignment operator. R has a case-sensitive
syntax. The variables “a” and “A” can coexist and have different values in the R environment.
1 > # variable assignment
2 > a <- 2022
3 > a
4 [1] 2022
This object referred to as “a” is stored in your workspace. You can always see what is stored in the
workspace by using the ls() function:
1 > ls ()
2 [1] " a "
1.3.1 Basic data types in R. R works with numerous data types that can be used to store different
kinds of data. The following are the most common types of data:
• Numeric. This data type is used to store numbers, including integers and decimal numbers.
Examples of numeric data in R include the following:
1 > # numeric
2 > x <- 3.5
3 > x
4 [1] 3.5
5 > class ( x )
6 [1] " numeric "
7 > # integer
8 > y <- 5
9 > y
10 [1] 5
11 > class ( y )
12 [1] " numeric "
13 > z <- x + y
14 > z
15 [1] 8.5
16 > class ( z )
17 [1] " numeric "
• Character. This data type is used to store text (or string) values. To display output or results,
you can either use the print() function or just the variable name. Examples of character data
in R include the following:
1 > # character data
2 > module _ name <- " Welcome to DSC2608 - Applied Quantitative Modelling "
3 > print ( module _ name )
4 [1] " Welcome to DSC2608 - Applied Quantitative Modelling "
5 > class ( module _ name )
6 [1] " character "
• Logical. This data type is used to store Boolean values, which can be either TRUE or FALSE. In
the following example, is recession is a logical variable that is set to FALSE. This variable could
be used to represent whether or not the economy is currently in a recession.
1 > # logical ( or Boolean )
2 > is _ recession <- FALSE
3 > class ( is _ recession )
4 [1] " logical "
• Factor. This data type is used to hold categorical data where each value corresponds to a certain
category. The following are some examples of factor data in R:
1 > credit _ rating <- c ( " AAA " , " BBB " , " BB " , " B " , " CCC " )
2 > rating <- factor ( c ( " BBB " , " AAA " , " BB " , " B " , " CCC " ) ,
3 + levels = credit _ rating )
4 > credit _ rating
5 [1] " AAA " " BBB " " BB " " B " " CCC "
6 > rating
7 [1] BBB AAA BB B CCC
8 Levels : AAA BBB BB B CCC
Section 1.3. Data types and data structures Page 11
A factor variable called rating is created using the factor() function. The levels argument
defines the possible values of the factor, in this case the credit ratings.
1.3.2 Data structures. In order to store and manipulate data, R provides a wide range of data structures.
The following are a few common R data structures:
• Vectors. To assign multiple values to a variable, we can use an R object called a vector. A
vector is a sequence/collection of data elements of the same data type such as numeric data and
characters. Members in a vector are called components. The c() function in R concatenates its
arguments to generate a vector, which can be used to build vectors. Here are some examples of
vectors in R:
1 > # create a numeric vector of stock prices : daily closing prices
2 > # over a certain period of time for the S & P 500 index
3 > stock _ prices <- c (4000 , 4015 , 4030 , 4025 , 4010 , 4025 , 4040 , 4055 , 4045 ,
4060 , 4080 , 4090)
4 > class ( stock _ prices )
5 [1] " numeric "
To access a specific element in the vector, you can use square brackets and specify the position
of the element you want to retrieve. For example, if you have a vector named gdp growth and
you want to access the third element in the vector, you would use gdp growth[3]:
1 > gdp _ growth [3]
2 [1] 0.6
• Matrices. A matrix is a sequence/collection of data elements of the same data type arranged in
a two-dimensional rectangular layout with rows and columns.
The matrix() function in R can be used to construct a matrix. It accepts a vector of data and
the dimensions of the matrix as inputs or arguments.
1 > # create a 3 x4 matrix from the stock _ prices vector
2 > stock . price _ matrix <- matrix ( stock _ prices , nrow = 3 , ncol = 4 , byrow = TRUE )
3
4 > # print the resulting matrix
5 > stock . price _ matrix
6 [ ,1] [ ,2] [ ,3] [ ,4]
7 [1 ,] 4000 4015 4030 4025
8 [2 ,] 4010 4025 4040 4055
9 [3 ,] 4045 4060 4080 4090
Note that the nrow argument specifies the number of rows, the ncol argument specifies the
number of columns and the byrow argument specifies that the matrix should be filled row by row
(as opposed to column by column). Additionally, to access a specific element in the matrix, you
use variable name[i, j], where i is the row index and jis the column index. To access an entire
row, you use variable name[i, ], where i is the row index. To access an entire column, you use
variable name[, j], where j is the column index.
For example, if you use the stock.price matrix and you want to access the element in the second
row and third column, you would use stock.price matrix[2, 3], as follows:
Section 1.3. Data types and data structures Page 12
1 > # access the element in the second row and third column
2 > stock . price _ matrix [2 ,3]
3 [1] 4040
4
• Data frames. A data frame is used for storing data tables. It is a list of vectors of equal length.
Unlike matrices, it can gather vectors containing different variable types. You can create a data
frame in R using the data.frame() function.
1 > # # creates a data frame df with 3 columns : price , currency , and logical .
operator
2 > df <- data . frame ( price = c (2000 , 109.26 , 139866.50) ,
3 + currency = c ( " ZAR " , " USD " , " EUR " ) ,
4 + logical . operator = c ( TRUE , TRUE , FALSE ) )
The selection of specific elements in data frames works the same way as for matrices. The
command df [i, j] would return the element on the ith row, in the jth column. For example, here
is how to retrieve the element in third the row and the first column of the df data frame:
1 > # retrieve third row and first column
2 > df [3 ,1]
3 [1] 139866.5
• Lists. A list in R allows you to gather a variety of objects under one name in an ordered way.
These objects can be matrices, vectors, data frames or even other lists. It is NOT required that
these objects must be related to each other.
The following example creates a list called company inf o that contains four elements with dif-
ferent data types:
1 > company _ info <- list (
2 + company _ name = " XYZ Holdings Ltd " ,
3 + share _ price = 219.34 ,
4 + num _ employees = 152000 ,
5 + is _ public = TRUE )
A list named company inf o that contains the details about a company called XYZ Holdings Ltd
is being created by the R code. Included in the data are the company name, share price, total
number of employees and whether or not the company is traded publicly. Generally, lists can be
useful for storing and organising information of different data types that could for instance be
used to perform various analyses on individual companies.
To access the ith object in the list, write list name[[i]]. If you want to access a variable in
the ith object of the list, you can use the $ operator followed by the name of the variable. If you
do not know the index of the item in the list, you can also use the names of the items in the list
instead of an index.
For instance, use company inf o$shareprice to retrieve the share price variable in the company inf o
list.
Section 1.4. Sequences Page 13
1.4 Sequences
1.4.1 Basic sequences. An important way of creating vectors is to generate a sequence of numbers. The
colon operator : is used to generate integer sequences. It can also be combined with other operations
to build more complex sequences. For creating and iterating through value ranges, it is a very helpful
operator in R.
For instance, 1:10 will generate a sequence of numbers starting from 1 to 10 by steps of 1. This means
there is a need to specify the first and the last values separated by a colon.
1 > # create a sequence of numbers from 1 to 10
2 > 1:10
3 [1] 1 2 3 4 5 6 7 8 9 10
More generally, any arithmetic progression can be generated by the function seq(). The parameters of
seq are shown in the following list:
• seq(from, to) specifies the first and the last values.
• seq(from, to, by = ) specifies the first and last values with the step size.
• seq(length.out = ) – creates an evenly spaced sequence
• seq(from, to, length.out = ) creates an equally spaced sequence by value specified.
• seq(along.with = ) requires another object.
The following illustrates each of the functions:
1 > seq ( from =1 , to =10)
2 [1] 1 2 3 4 5 6 7 8 9 10
3 > # generates a sequence of numbers from 0 to 15 with a step of 3.
4 > seq ( from =0 , to =15 , by =3)
5 [1] 0 3 6 9 12 15
6 > # generates a sequence of 15 evenly spaced numbers between 0 and 1 , inclusive .
7 > seq ( length . out =15)
8 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
9 > # generates a sequence of 5 evenly spaced numbers between -5 and 5 , inclusive .
10 > seq ( from = -5 , to = 5 , length =5)
11 [1] -5.0 -2.5 0.0 2.5 5.0
12 > # generates a sequence of numbers from 0 to 10 with a default step of 1 ,
13 > # and assigns it to the variable s1 .
14 > s1 <- seq (0 ,10)
15 > s1
16 [1] 0 1 2 3 4 5 6 7 8 9 10
17 > # generates a sequence of numbers from 1 to 5 with a step of 1 ,
18 > # with the same length as the sequence stored in the variable s1
19 > s2 <- seq (1 ,5 , along . with = s1 )
20 > s2
Section 1.4. Sequences Page 14
21 [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
Sometimes it is necessary to have repeated values. In such instances the function rep() can be used
where repeated values are required. In R, it is an iteration function, which means repetition:
1 > # generates a 10 - dimensional vector , where each element is equal to 5
2 > rep (5 ,10)
3 [1] 5 5 5 5 5 5 5 5 5 5
4 > # each number from 5 to 10 is repeated twice .
5 > rep (5:10 , each =2)
6 [1] 5 5 6 6 7 7 8 8 9 9 10 10
7 > # length . out specifies the length of the output vector
8 > rep (5 ,10 , length . out =10)
9 [1] 5 5 5 5 5 5 5 5 5 5
Within the Additional Resources folder on the module site, you can find an Exercise Manual that
includes exercises aimed at evaluating your understanding of each Learning Unit.
Complete Activity 1.1 in the Exercise Manual before you proceed to the next subsection.
1.4.2 Arrays. An array is a multidimensional R data object that can store data in more than two
dimensions. A matrix is the special case of a two-dimensional array. It is created using the array()
function, and it takes vectors as input and uses the values in the dim parameter to create an array.
The following is an example of a 3 × 3 array in R created from a matrix of stock prices:
1 > # create a matrix with stock prices
2 > stock . prices _ matrix <- matrix ( c (100 , 150 , 200 , 125 , 175 , 225 , 150 , 200 , 250) ,
3 + nrow =3 , ncol =3 , byrow = TRUE )
4 > # convert the stock . price _ matrix to a 3 x3 array
5 > stock . prices _ array <- array ( stock . prices _ matrix , dim = c (3 ,3) )
6 > # print the stock . prices _ array
7 > print ( stock . prices _ array )
8 [ ,1] [ ,2] [ ,3]
9 [1 ,] 100 150 200
10 [2 ,] 125 175 225
11 [3 ,] 150 200 250
Here is another example to help you understand better: Imagine you have information about the gross
domestic product (GDP) of three countries – X, Y and Z – for four years – 2019, 2020, 2021 and 2022.
This information is measured in billions of Rands. You can represent these data in an array using the
following code:
1 > # create an array for GDP data
2 > gdp _ array <- array ( c (100 , 120 , 140 , 150 , 200 , 220 , 240 , 260 , 300 ,
3 + 330 , 350 , 380 , 50 , 60 , 70 , 80) ,
4 + dim = c (3 , 4) )
5 > # print the gdp _ array
6 > print ( gdp _ array )
7 [ ,1] [ ,2] [ ,3] [ ,4]
8 [1 ,] 100 150 240 330
9 [2 ,] 120 200 260 350
10 [3 ,] 140 220 300 380
Moreover, you can update the names of the rows and columns in the gdp array by using row.names,
colnames or dimnames. This will help you to organise and label data in a clear and meaningful way.
Section 1.5. Tables Page 15
Just like we have seen before with vectors, list, matrices, etc., array elements can also be accessed. For
instance, print the element in the second row and third column of the gdp array:
1 > # second row and third column element of gdp _ array
2 > print ( gdp _ array [2 ,3])
3 [1] 260
Complete Activity 1.2 in the Exercise Manual before you proceed to the next section.
1.5 Tables
Tables are a fundamental object type for representing datasets. A table can be viewed in two ways:
1. as a sequence of named columns that each describe a single aspect of all entries in a dataset
2. as a sequence of rows that each contain all information about a single entry in a dataset
Tables are similar to arrays in that they can store multiple values. Table 1.2 presents some key differences
between tables and arrays:
Tables Arrays
Tables have a fixed number of columns and each Arrays can be multidimensional and can hold ele-
column has a defined data type or format. ments of any type, including other arrays.
Tables typically have named columns, which allows Arrays, on the other hand, usually have numerical
for easy reference to specific attributes of the data. indices to access elements.
Tables are more flexible when it comes to adding or
In arrays, adding or removing elements might re-
removing columns, as the structure remains con-
quire reshaping or resizing the entire array.
sistent across rows.
Another specific type of table commonly used in R for statistical and data analysis is called a data frame.
It is a two-dimensional table-like data structure where each column can contain values of different types
(e.g., characters, numerics, logical) and can have its own name. Data frames are particularly useful for
working with heterogeneous datasets, where different variables may have different data types.
The following Subsections 1.5.1 to 1.5.4 provide some useful table operations:
1.5.1 Binding. The binding of columns can be executed when two datasets, a dataset and a vector, or
two vectors have the same number of values (or the same number of rows in the case of datasets). They
can be placed together into one dataset using cbind() or bind cols. This is different from merging
Section 1.5. Tables Page 16
(which is discussed later in the study guide), hence there is no row matching system. Similarly, with
rows binding, the rbind() or bind rows() can be used in R. Let us create two data frames df 1 and
df 2 containing the student ID and module name, and perform some row and column binding operations.
1 > # create a data frame with student IDs and the economics modules
2 > df1 <- data . frame ( student _ id = c (1:5) ,
3 + module = c ( rep ( " Microeconomics " , 2) , rep ( " Macroeconomics " , 3) ) )
4 > # print the data frame to the console
5 > df1
6 student _ id module
7 1 1 Microeconomics
8 2 2 Microeconomics
9 3 3 Macroeconomics
10 4 4 Macroeconomics
11 5 5 Macroeconomics
1.5.2 Sorting. In R, it is possible to sort the rows of a dataset either in alphabetical order based on
a column containing character variables, or in numerical order based on a column containing numeric
variables. Additionally, the sorting can be done in either ascending or descending order using the sort()
function.
how to create a vector of student IDs using the c() function. The vector contains ten elements, each
of which is a unique student ID:
1 > # create a vector of student IDs
2 > student _ id <- c (1011 , 1005 , 1013 , 1017 , 1009 , 1001 , 1003 , 1015 , 1007 , 1019)
3 > student _ id
4 [1] 1011 1005 1013 1017 1009 1001 1003 1015 1007 1019
5 > # sort the student IDs in ascending order ( from lowest to highest )
6 > sorted _ ids _ ascending <- sort ( student _ id )
7 > sorted _ ids _ ascending
8 [1] 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019
9 > # sort the student IDs in descending order ( from highest to lowest )
10 > sorted _ ids _ descending <- sort ( student _ id , decreasing = TRUE )
11 > sorted _ ids _ descending
12 [1] 1019 1017 1015 1013 1011 1009 1007 1005 1003 1001
Here is another example of how to create a data frame of student marks and sort them in descending
order. It further adds a result column to indicate whether the student passed or failed given that the
pass mark is greater than or equal to 50:
1 > # create a data frame of student marks
2 > marks <- data . frame (
3 + student _ name = c ( " Tlou " , " Jane " , " Piers " , " Sarah " , " Thato " ) ,
4 + DSC2608 _ marks = c (80 , 75 , 44 , 39 , 92)
5 + )
6 > # print the student marks data frame
7 > print ( marks )
8 student _ name DSC2608 _ marks
9 1 Tlou 80
10 2 Jane 75
11 3 Piers 44
12 4 Sarah 39
13 5 Thato 92
The − sign preceding marks$DSC2608 marks indicates the descending order. When sorting data using
the − sign, it reverses the usual ascending order and arranges the data in descending order.
The following R code uses the kable() function from the knitr package to create a formatted table
of students’ marks and their result in DSC2608 Marks. The knitr package is used to improve the
Section 1.5. Tables Page 18
formatting of output to make it easier to read. Refer to Section 1.6 for an explanation of how to use
packages in R.
1 > library ( knitr )
2 > # create a table using kable ()
3 > kable ( marks , align = c ( " l " , " c " , " c " ) ,
4 + col . names = c ( " Student Name " , " DSC2608 Marks " , " Result " ) )
5
6
7 | | Student Name | DSC2608 Marks | Result |
8 |: - -|: - - - - - - - - - - - -|: - - - - - - - - - - - - - - -:|: - - - - - -:|
9 |5 | Thato | 92 | pass |
10 |1 | Tlou | 80 | pass |
11 |2 | Jane | 75 | pass |
12 |3 | Piers | 44 | fail |
13 |4 | Sarah | 39 | fail |
1.5.3 Transformation. New columns can be created or existing ones modified by applying transfor-
mations to them. Transformations can be adding, subtracting, multiplying, dividing and raising to a
power. Functions such as log() and exp() can also be applied. An additional column pass f ail can
be added to the existing marks data frame using the mutate() function instead.
1 > library ( dplyr )
2 > # additional transformation using mutate function from dplyr package
3 > marks <- marks % >%
4 + mutate ( pass _ fail = ifelse ( DSC2608 _ marks >= 50 , " pass " , " fail " ) )
5 > marks
6 student _ name DSC2608 _ marks pass _ fail
7 1 Tlou 80 pass
8 2 Jane 75 pass
9 3 Piers 44 fail
10 4 Sarah 39 fail
11 5 Thato 92 pass
The code then uses the mutate() function from the dplyr package to add a new column named
pass f ail to the data frame, which indicates whether a student passed or failed the DSC2608 module
based on DSC2608 marks. The ifelse() function is used to assign pass or fail conditionally to the
new column based on whether the mark is greater or less than 50, respectively.
Using the same marks data frame, you can give the columns more descriptive names using the
rename() function from the dplyr package. The student name column is renamed Student N ame,
the DSC2608 marks column is renamed DSC2608 M arks and the pass f ail column is renamed
Results.
1 > library ( dplyr )
2 > # rename the columns in the marks data frame to more descriptive names
3 > marks <- marks % >%
4 + rename ( Student _ Name = student _ name , DSC2608 _ Marks = DSC2608 _ marks ,
5 + Results = pass _ fail )
6 > marks
7 Student _ Name DSC2608 _ Marks Results
8 1 Tlou 80 pass
9 2 Jane 75 pass
10 3 Piers 44 fail
11 4 Sarah 39 fail
12 5 Thato 92 pass
Section 1.5. Tables Page 19
1.5.4 Filtering/Subsetting. Use a logical operator, such as ==, >, <, <=, >=, ! = to filter or find a
subset. Note that the equals logical operator is two == signs, a single = is reserved for an assignment.
The result is a logical variable.
You can use the filter() function from dplyr package to filter the data in the data frame based on
certain conditions. For instance, the following R code filters the marks data frame to create a new
data frame named passed that only includes the records for students who passed the DSC2608 module,
based on the Results column. The filtered data frame is printed to the console using the print()
function:
1 > # filter the data to only include students who passed DSC2608 module
2 > passed <- marks % >% filter ( Results == " pass " )
3 > print ( passed )
4 Student _ Name DSC2608 _ Marks Results
5 1 Tlou 80 pass
6 2 Jane 75 pass
7 3 Thato 92 pass
Similarly, the following R code creates a new data frame named f ailed. The filter condition is based
on the Results column of the marks data frame, where only the records that have a value of fail in
the Results column are selected:
1 > # filter the data to only include students who failed DSC2608 module
2 > failed <- marks % >% filter ( Results == " fail " )
3 > print ( failed )
4 Student _ Name DSC2608 _ Marks Results
5 1 Piers 44 fail
6 2 Sarah 39 fail
7
Finally, create a vector called exam marks that contains the values 87, 62, 55, 41 and 96. Furthermore,
use the pipe operator %>% with the mutate() function to add the new column Exam marks to the
updated data frame that is assigned to a new variable called updated marks:
1 > # create a vector of exam marks
2 > exam _ marks <- c (87 , 62 , 55 , 41 , 96)
3 > # add the new column to the marks data frame
4 > updated _ marks <- marks % >% mutate ( Exam _ marks = exam _ marks )
5 > # print the updated marks data frame
6 > updated _ marks
7 Student _ Name DSC2608 _ Marks Results Exam _ marks
8 1 Tlou 80 pass 87
9 2 Jane 75 pass 62
10 3 Piers 44 fail 55
11 4 Sarah 39 fail 41
12 5 Thato 92 pass 96
Finally, the following R code creates a new data frame called pass distinction, which contains only the
rows from the updated marks data frame where the Exam marks are greater than or equal to 75 and
the Results column is equal to pass. This is done using logical operators and indexing.
1 > # students who scored above or equal to 75 on the exam and passed
2 > pass _ distinction <- updated _ marks [ updated _ marks $ Exam _ marks >= 75 &
3 + updated _ marks $ Results == ’ pass ’ , ]
4 > pass _ distinction
5 Student _ Name DSC2608 _ Marks Results Exam _ marks
6 1 Tlou 80 pass 87
7 5 Thato 92 pass 96
Section 1.6. Importing packages and datasets, viewing data Page 20
Complete Activity 1.3 in the Exercise Manual before you proceed to the next section.
R packages are collections of R functions and datasets. Some standard packages are included with the
R installation, while others can be installed in RStudio by using the install.packages("package
name") function. Some packages must be downloaded from http://cran.r-project.org/ or Google and
manually installed. Once installed, the package needs to be loaded in each session by using the following
R syntax:
library(“package name”)
Often data are available in different formats ready to be imported into R. R accepts files with different
formats, for example, txt, .csv and .xls. For instance, the following can be used to read several
frequently used files in R:
• Text files, use read.table() for space separated files, comma separated files
• CSV files, use read csv() from the readr package (used by Rstudio interface)
• Excel files, use read excel() from the readxl package (used by Rstudio interface)
There are several ways to look at a dataset to view (or get an overview of data). Firstly, you can simply
extract the data entirely by double clicking on the dataset in the global environment or by using the
View(data name) function.
A specific column can be extracted by using the data name$column name command. The first k rows or
the last k rows can be extracted by using the head(data name, k) or tail(data name, k) command,
respectively.
A quick overview of the dataset can be obtained with the summary(data name) or str(data name)
command.
Complete Activity 1.4 in the Exercise Manual before you proceed with the formal assessment for
Learning Unit 1.
Now that you have reached the end of Learning Unit 1, complete Assessment 1 as outlined in the
Activities section on the module site and ensure that you submit the completed assessment for formal
evaluation.