BA - Unit 4
BA - Unit 4
4
Data Structures in R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Vectors
4.4 Matrices
4.5 Lists
4.6 Factors
4.7 Data Frames
4.8 Conditionals and Control Flows
4.9 Loops
4.10 Apply Family
4.11 Summary
4.12 Answers to In-Text Questions
4.13 Self-Assessment Questions
4.14 References
4.15 Suggested Readings
68 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Identify when and how to select among various data structures and Notes
control mechanisms.
Write cleaner, more efficient R code using the strength of functional
programming.
4.2 Introduction
In this lesson we will discuss about data structures and their use to orga-
nize and process data more efficiently. You may think of a data structure
as a blueprint that indicates how to arrange and store data. The design
of any structure is deliberate, as it allows data access and manipulation
in certain, structured ways. We use specialized methods or functions to
interact with these structures in programming and statistical software like
R. These tools are built for easier working with data of all shapes and
forms. R offers six key data structures to work with: Vectors, Matrices,
Arrays, Lists, Factors and Data frames.
Further these can be divided into two categories Homogenous and Hetero-
geneous structures. The first three vectors, matrices, and arrays are like
neat, organized boxes, where everything is of the same type hence they
are called homogenous. On the other hand, heterogeneous structures are
data frames and lists that allow for greater flexibility. They can accom-
modate elements of various types to coexist together. Factor is a special
data structure specially used for handling categorical data (nominal or
ordinal). In the subsequent sections we will discuss these data structures.
A point to remember for those who are already familiar with program-
ming; R has no scalar types, in fact, numbers, strings or any other scalar
are vectors of length one.
4.3 Vectors
It is one of the basic data structures in R programming languages, it is
used to store multiple values having same type also called modes. It is
one-dimensional and can hold numeric, character, logical or other values,
but all the values must have same mode. Vectors are fundamental to R,
hence most of the operations are performed on vectors. Various types of
vectors are shown in Table 4.1 below:
PAGE 69
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Creating a Vector
You can create vectors using the c() function, which stands for combine
or concatenate. Also, vectors are stored contiguously in memory just
like arrays in C, hence the size of vector is determined at the time of
creation. Thus, any modification to the vector will lead to reassignment
(creating a new vector with same name internally). Code to create and
display a few vectors is shown below in code window 1.
Code Window 1
Another point to note is c() function allows you to modify or reassign
an existing vector as shown in code window 2.
Code Window 2
70 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
This will add 10 to the vector v1 in the end or on 4th place as instructed. Notes
Vectors are useful for analysis as R allows us to use various operations
over them. In this section we will explore various operations that can
be used on vectors.
Length: We can obtain the length of a vector using length() function.
This can be used to iterate over vector in loops.
This will give 3 and (3,23,4) as output. You can also give nega-
tive index to omit a value like print(v1[-2]) will output all values
except second index.
You can also apply filtering to vectors by applying logical expressions
that return true/false for each vector, output is given by true values.
PAGE 71
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
72 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 3
Miscellaneous Functions: There are certain functions shown below
in Table 4.4 which can be used with vectors, as required.
Table 4.4: Miscellaneous Functions
4.4 Matrices
Since you have understood vectors and various operations that can be
applied on them, now, let’s talk about matrices. You can understand a
matrix as an enhanced vector: it’s really nothing but a vector with two
extra attributes; namely the number of rows and the number of columns.
PAGE 73
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes As with vectors, matrices are also homogenous. However, don’t mix up
one-row or one-column matrices with vectors-they are not the same.
Now, matrices are actually a special type of a broader concept in R called
arrays. While matrices have just two dimensions (rows and columns),
arrays can go further and have multiple dimensions. For instance, a
three-dimensional array has rows, columns, and layers, adding an extra
level of organization into your data. The reason that matrices are useful
in R is the vast array of operations that you can carry out on them. Many
of these operations are based upon what you know already about vectors,
such as subsetting and vectorization, but it expands these in two dimen-
sions. The added structure of rows and columns makes matrices ideal
for mathematical operations, data manipulation, and statistical modelling.
The various operations on matrices are discussed below:
Creation: Matrices are generally created using matrix() function,
the data in matrices is stored in column major format by default.
The ‘nrow’ parameter specifies rows, and ‘ncol’ specifies columns.
We can use ‘byrow = TRUE’ to fill data row-wise in matrix instead
of column-wise. Code to create matrix using matrix() function and
by using vectors is shown below in code window 4.
Code Window 4
74 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 5
Code Window 6
PAGE 75
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 7
As you may have noticed in the code window 7 that arithmetic multi-
plication and matrix multiplication are two different functions. Some of
the other functions are rowSums() and colSums() that give sum of rows/
columns and rowMeans() and colMeans() that give mean of rows/columns.
Just like vectors indexing and subsetting can be done on matrices.
You can access specific elements, rows, or columns using indices
as shown in code window 8.
Code Window 8
76 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
based on logical criteria. Some examples are shown below in code Notes
window 9.
Code Window 9
You can also give name to the rows and columns of a matrix using
the dimnames() function or by specifying them during the creation
of the matrix (as shown in code window 10).
Code Window 10
PAGE 77
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Arrays
An array in R is a data structure that can store data in more than
one dimension, hence in R arrays are an extension of matrix. While a
matrix is constrained to two dimensions, with rows and columns, an
array, however, can take three or more dimensions. Arrays are more
useful for organizing and manipulating data having more than two
axes, such as 3D spatial data or multi-dimensional experimental results.
Array can be created using array() function with arguments data,
dimensions and dimension names as shown in code window 11.
Code Window 11
Code Window 12
78 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 13
4.5 Lists
In R, a list is an amazingly flexible data structure, meaning it can store
any kind of data together - numbers, characters, vectors, matrices, and
even other lists. This flexibility makes list different from vectors or ma-
trices, which insist on elements to be of the same class. A list is useful
for organizing complex data where different types may coexist. In R, lists
are used frequently, not only for storing results from statistical models
but also in general for organizing heterogeneous data:
You create a list by using the “list()” function, and any of the
elements in the list are accessed using double square brackets “[[
]]”. So for instance, “list(42, “Hello”, c(1, 2, 3))” generates a list
that has an integer, a string, and a vector.
PAGE 79
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 14
Code Window 15
We can find size of a list using length(), we can also add or delete
elements.
80 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 16
4.6 Factors
Factors are another type of R objects that are created by using a vector,
it stores the vector as well as a record of distinct values in that vector
called level. Factors are majorly used for nominal or categorical data.
Code Window 17
As shown in code window 17 factor fac has 8 values but only 3 different
levels. Level is very useful as shown in code window 18:
PAGE 81
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 18
Here in the code window, case 1 shows that we tried to assign a value
to factor index 2 and it was successfully done as the value belonged to
predefined level, but in case 2 we got NA assigned to index 2 o factor
instead of 15 because 15 was not present in factor level. In case 3 we
anticipated a new level which was not present in initial vector, but we gave
it in factor definition. Thus illegal values cannot be assigned to vectors.
Two commonly used functions with vectors are split() and by().
As the name suggests split() function is used to divide an object
82 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(such as a vector, data frame, or list) into subsets based on a certain Notes
grouping factor, it is particularly useful when you want to break
down your data into smaller groups according to a factor (like a
categorical variable).
Code Window 19
As shown in code above the vector data is split into groups A,B,C cor-
responding to their factor. However, by() function is used to apply a
function to subsets of a data object that have been grouped by a factor.
It is used when for scenarios where you want to perform operations like
calculating the mean, sum, or other statistical measures for each group
as shown in code window 20.
PAGE 83
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 20
84 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
are of the same length, and meaningful column names can be assigned Notes
for easy interpretation and management of data.
Data frame creation is shown in code window 21.
Code Window 21
Code Window 22
PAGE 85
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Subsets can be extracted from data frames based on row and column
selection or using logical conditions or by using the subset() function
as shown in code window 23.
Code Window 23
Code Window 24
86 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
We can used rbind() or cbind() to combine two data frames row Notes
wise or column wise provided they have same number of columns
in case of rbind() and vice versa. We can also use merge function
to combine two or more data frames by matching rows based on
common columns.
Code Window 25
PAGE 87
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Figure 4.1
There are three decision making constructs in R programming: if, if…
else, switch.
88 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 26
PAGE 89
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 27
Code Window 28
90 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 29
PAGE 91
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.9 Loops
Like any other programming language, we have loops in R too. They
are basic constructs allowing a block of code to be executed repeatedly.
R implements several kinds of loops: for, while, and repeat. Each loop
type is suited for different tasks, depending on the kind of control flow
needed. We will discuss the code and syntax of each of these loops in
this section.
For Loop: It is used to iterate over a sequence of elements (that
are iterate able), such as a vector, list, or sequence using a loop
control variable. The code of for loop is given in code window 30
Code Window 30
The above code iterates over a vector and prints all elements of vectors
one by one. We can write code to iterate over other data structures in
the same manner.
Like for loop, while loop also repeatedly executes a block of code
as long as the condition remains TRUE. But here the loop control
variable needs to be initialized outside the loop. While code to print
sum of 5 numbers is shown below in code window 31, iteration
variable is increment inside the loop.
92 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 31
Code Window 32
PAGE 93
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes We can also have nested loops for complex operations where iterations
are needed at various levels. For example, if you want to print
columns for each row, nested code is shown in code window 33.
Code Window 33
Here, the outer loop takes each value of i, and for every single value of
i, the inner loop takes each value of j. This structure makes sure that for
each pair of values taken by i and j, one calculation is performed—it is
the product of i and j. The result of this calculation is then printed. This
is typically applied when many tasks require the calculation of tables,
pairwise comparisons, or generally any combinatorial operation involving
several variables.
Next and break statements can be used to control loop, next helps
to skips the current iteration and moves to the next one while break
terminates the loop entirely as seen in repeat loop. Code is given
in code window 34.
94 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 34
Thus, in R, loops are very helpful when automating repetitive tasks; for
loop iterates over elements in a sequence, such as vectors or lists, exe-
cuting a block of code for each element. The “while” loop continues to
execute if a specified condition is “TRUE”. This makes a good choice
for tasks where one doesn’t know beforehand the number of iterations. A
loop will run endlessly until stopped using a break statement that should
be provided, ideal when the condition for its stop is more complex in
expression.
Although loops are very general, R’s vectorized operations and apply-fam-
ily functions are often much faster alternatives to handle large datasets
or for simple operations, so are generally preferred in most cases.
IN-TEXT QUESTION
5. Write an R code snippet using an if-else statement to check if
a number is even or odd.
PAGE 95
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.10 Apply Family
The apply family in R includes functions like apply, lapply, sapply, vap-
ply, tapply, mapply, and rapply. It is very useful and powerful feature of
R. These functions provide alternatives to loops for applying functions
across various data structures like vectors, matrices, arrays, lists, factors,
and data frames. They are generally more concise and can improve code
readability and performance for vectorized operations, loops can be slower
than vectorized operation. In this section we will discuss these functions
one by one along with code.
The apply() is used to operate on margins of matrix and array.
It applies a given function along rows or columns of a matrix or
higher-dimensional array. The syntax is apply(X, MARGIN, FUN)
where X is matrix or array, margin refers dimensions and fun is
the function that we need to apply.
Code Window 35
96 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 36
Code Window 37
PAGE 97
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes vapply() is also like lapply() and sapply() but it lets you to specify
the expected output type for better reliability.
Code Window 38
Code Window 39
Code Window 40
98 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 41
In this code we have given nested list to function rapply and x^2 is
applied to each element of list. Classes specify to apply function only
to specific classes and how return structure like “unlist” for vector or
“replace” for nested list.
Table 4.5 shows various functions of apply family.
4.11 Summary
In this chapter we have covered some of the basic building blocks in R
that serve as the foundation for manipulating data and controlling pro-
grams. Vectors are one-dimensional arrays that hold elements of a similar
type, whereas matrices extend this concept to two dimensions, and arrays
PAGE 99
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes generalize further to n-dimensions. Lists, however, are containers that can
hold elements of different types, making them very versatile. Factors are
utilized to represent categorical data in a statistically efficient manner.
Data frames, which are a hybrid structure that combines the features of
lists and matrices, are ideal for organizing tabular data. To add logic to
your programs, tools like “if”, “else”, and “switch” allow decision-mak-
ing capabilities. For repetitive operations, loops like “for”, “while”, and
“repeat” are necessary; however, the apply family of functions provides
more efficient alternatives, enabling concise and functional programming.
This chapter has laid a solid foundation for dealing with data, writing
efficient code, and solving complex programming problems in R.
100 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
2. Explain how a list differs from a vector and give a practical example Notes
of when you would use a list instead of a vector.
3. Describe the structure of a data frame and explain why it is particularly
useful for working with tabular data.
4. Write an R code snippet using an if-else statement to determine
whether a number is positive, negative, or zero.
5. What is the purpose of the apply family of functions, and how do
they improve code efficiency compared to traditional loops?
4.14 References
Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly
Media.
Matloff, N. (2011). The Art of R Programming. No Starch Press.
Crawley, M. J. (2012). The R Book. Wiley.
PAGE 101
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi