[go: up one dir, main page]

0% found this document useful (0 votes)
16 views34 pages

MDPN460 Lecture03

Uploaded by

mohamedggharib02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views34 pages

MDPN460 Lecture03

Uploaded by

mohamedggharib02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

MDPN460 – Industrial

Engineering Lab
Lecture 3

Basic Statistical Analysis Using R Programming


1 / 34
Today’s Lecture

Basic statistical analysis using R
– More on vectors
– Matrices and arrays
– Data storage in R
– Packages, libraries and repositories
– Some built-in graphics functions

Introduction to the paper airplane factory software

A free course on Machine Learning with R:
https://www.youtube.com/watch?v=Liws4MShq1A 2 / 34
Numeric vectors in R

As shown in the last lecture, a numeric vector is a list of
numbers.

The c() function is used to collect things together into a
vector.

The c() function can be used to join vectors as in the
following example:
> v1 <- c(10:5)
> v2 <- c(22, 25, 65)
> v3 <- c(v1, v2)
> v3
[1] 10 9 8 7 6 5 22 25 65
> v4 <- c(v3, 2:6)
> v4
[1] 10 9 8 7 6 5 22 25 65 2 3 4 5 6 3 / 34
Extracting elements from vectors

Use a number inside square brackets after the vector’s
assigned name to access any element of the vector at
the index referenced by the used number.

Vectors in R are not zero-based, i.e., they start at 1.
> V <- 20:100
> V[50]
[1] 69
> V[1]
[1] 20
> V[80]
[1] 99
> V[81]
[1] 100
> V[82]
[1] NA 4 / 34
Extracting sub vectors from
vectors

Sub-vectors can be accessed using a colon between two
indexes.
> V[10:20]
[1] 29 30 31 32 33 34 35 36 37 38 39

Or you can specify indexes using c()
> V[c(2,5,50)]
[1] 21 24 69

You can exclude indexes
> V[-(20:70)]
[1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 90 91 92
[23] 93 94 95 96 97 98 99 100

5 / 34
Vector Arithmetic

All mathematical operations can be conducted on the
numerical values inside numerical vectors.
> x <- c(0, 1, 3, 7, 12, 20, 99)
>x*2
[1] 0 2 6 14 24 40 198
>x/2
[1] 0.0 0.5 1.5 3.5 6.0 10.0 49.5
> x^2
[1] 0 1 9 49 144 400 9801
> (x - 5) / 2
[1] -2.5 -2.0 -1.0 1.0 3.5 7.5 47.0
> (x * 3) %% 2
[1] 0 1 1 1 0 0 1
> (x * 3) %/% 2
[1] 0 1 4 10 18 30 148
6 / 34
Simple Patterned Vectors

We have seen the use of the : operator for producing
simple sequences of integers. Patterned vectors can also
be produced using the seq() function as well as the rep()
function.
> seq(1, 21, by = 2)
[1] 1 3 5 7 9 11 13 15 17 19 21
> rep(3, 12) # repeat the value 3, 12 times
[1] 3 3 3 3 3 3 3 3 3 3 3 3
> rep(seq(2, 20, by = 2), 2) # repeat the pattern 2 4 ... 20, twice
[1] 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
> rep(c(1, 4), c(3, 2)) # repeat 1, 3 times and 4 twice
[1] 1 1 1 4 4
> rep(c(1, 4), each = 3) # repeat each value 3 times
[1] 1 1 1 4 4 4
> rep(1:10, rep(2, 10)) # repeat each value twice
7 / 34
[1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
Vectors with random patterns

The sample() function allows us to simulate
things like the results of the repeated tossing of
a 6-sided die.
> sample(1:6, size = 8, replace = TRUE) # an imaginary die is
tossed 8 times
[1] 3 4 4 2 1 6 6 5
> sample(1:6, size = 8, replace = TRUE) # an imaginary die is
tossed 8 times
[1] 4 2 5 2 6 3 2 4
> sample(1:6, size = 8, replace = TRUE) # an imaginary die is
tossed 8 times
[1] 2 6 4 2 2 6 2 6

8 / 34
Character vectors

Scalars and vectors can be made up of strings of
characters instead of numbers.

All elements of a vector must be of the same type.

> Student <- c("Ahmed", "Yasser", "Mona", "Amr")


> sample(Student, size = 3, replace = TRUE)
[1] "Mona" "Yasser" "Mona"
> sample(Student, size = 3, replace = FALSE)
[1] "Amr" "Mona" "Ahmed"
> More.Student <- c(Student, 322)
> sample(More.Student, size = 3, replace = FALSE)
[1] "Ahmed" "322" "Yasser"

9 / 34
Basic operations on character
vectors

There are two basic operations you might want to
perform on character vectors.

To take substrings, use substr() . It takes arguments
substr(x, start, stop) , where x is a vector of character
strings, and start and stop say which characters to keep.
> Initial <- substr(More.Student, 1, 1)
> Initial
[1] "A" "Y" "M" "A" "3"
> Initial <- substr(More.Student, 1, 2)
> Initial
[1] "Ah" "Ya" "Mo" "Am" "32"

10 / 34
Basic operations on character
vectors

The other basic operation is building up strings by
concatenation within elements. Use the paste() function
for this.
> paste("Name:", Student, sep=" ")
[1] "Name: Ahmed" "Name: Yasser" "Name: Mona" "Name: Amr"
> paste("Name:", Student, "Initial:", Initial, sep=" ")
[1] "Name: Ahmed Initial: Ah" "Name: Yasser Initial: Ya" "Name: Mona Initial: Mo"
[4] "Name: Amr Initial: Am" "Name: Ahmed Initial: 32"
> paste("Name:", More.Student, "Initial:", Initial, sep=" ")
[1] "Name: Ahmed Initial: Ah" "Name: Yasser Initial: Ya" "Name: Mona Initial: Mo"
[4] "Name: Amr Initial: Am" "Name: 322 Initial: 32"
> picker <- paste("Picked student:", Student, sep = " ")
> sample(picker, size=4, replace = FALSE)
[1] "Picked student: Mona" "Picked student: Ahmed" "Picked student: Yasser"
[4] "Picked student: Amr"
11 / 34
Factors

Factors offer an alternative way to store character
data. For example, a factor with four elements and
having the two levels, control and treatment can be
created using:
> grp <- c("control", "treatment", "control", "treatment")
> grp
[1] "control" "treatment" "control" "treatment"
> grp <- factor(grp)
> grp
[1] control treatment control treatment
Levels: control treatment
> levels(grp)
[1] "control" "treatment"
> as.integer(grp)
[1] 1 2 1 2 12 / 34
Factors

The levels() function can be used to change factor
labels as well. For example, suppose we wish to
change the "control" label to "placebo" . Since
"control" is the first level, we change the first
element of the levels(grp) vector:
> levels(grp)
[1] "control" "treatment"
> as.integer(grp)
[1] 1 2 1 2
> levels(grp)[1] <- "placebo"
> grp
[1] placebo treatment placebo treatment
Levels: placebo treatment
13 / 34
Matrices and Arrays

To arrange values into a matrix, we use the matrix()
function:
> m1 <- matrix(1:6, nrow = 2, ncol = 3)
> m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

We can then access elements using two indices. For
example, the value in the first row, second column is
> m1[1, 2]
[1] 3


Alternatively,
> m1[6]
[1] 6
> m1[4] 14 / 34
[1] 4
Accessing whole rows or cols

Whole rows or columns of matrices may be
selected by leaving one index blank:
> m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> m1[2,]
[1] 2 4 6
> m1[,2]
[1] 3 4

15 / 34
More general arrays

A more general way to store data is in an array. Arrays have
multiple indices, and are created using the array function:
> a1 <- array(sample(1:24), c(3, 4, 2))
> a1
,,1

[,1] [,2] [,3] [,4]


[1,] 21 7 13 14
[2,] 17 18 23 22
[3,] 5 8 4 11

,,2

[,1] [,2] [,3] [,4]


[1,] 12 15 16 1
[2,] 2 19 24 9
16 / 34
[3,] 3 6 20 10
Data storage in R

Like any computer programming language,
numerical values are internally stored and
processed using the binary format.

This leads to rounding off errors, as shown in the
following example.
> n <- 1:10
> 1.25 * (n * 0.8) - n
[1] 0.000000e+00 0.000000e+00 4.440892e-16 0.000000e+00 0.000000e+00 8.881784e-16
[7] 8.881784e-16 0.000000e+00 0.000000e+00 0.000000e+00

17 / 34
Dates and Times

When looking at dates over historical time periods,
changes to the calendar (such as the switch from the
Julian calendar to the modern Gregorian calendar
that occurred in various countries between 1582 and
1923) affect the interpretation of dates.

Times are also messy, because there is often an
unstated time zone (which may change for some
dates due to daylight savings time), and some years
have “leap seconds” added in order to keep standard
clocks consistent with the rotation of the earth.

18 / 34
Dates and Times

In R, The base package has the function strptime() to
convert from strings (e.g. "2020-12-25" , or
"12/25/20" ) to an internal numerical representation,
and format() to convert back for printing.

The ISOdate() and ISOdatetime() functions are used
when numerical values for the year, month, day, etc.
are known. Other functions are available in the
chron package.

19 / 34
Missing values and other
special values

The missing value symbol is NA . Missing values
often arise in real data, but they can also arise
because of the way calculations are performed.
> some.evens <- NULL # creates a vector with no elements
> some.evens[seq(2, 20, 2)] <- seq(2, 20, 2)
> some.evens
[1] NA 2 NA 4 NA 6 NA 8 NA 10 NA 12 NA 14 NA 16 NA 18 NA 20


What happened here is that we assigned values
to elements 2,4, . . . ,20 but never assigned
anything to elements 1,3, . . . ,19, so R uses NA to
signal that the value is unknown. 20 / 34
Missing values and other
special values

Consider the following:
> x <- c(0, 1, 2)
>x/x
[1] NaN 1 1


The NaN symbol denotes a value which is “not a
number” which arises as a result of attempting
to compute the indeterminate 0/0.

This symbol is sometimes used when a
calculation does not make sense. In other cases,
special values may be shown, or you may get an
error or warning message. 21 / 34
Missing values and other
special values

Consider the following:
> x <- c(0, 1, 2)
>1/x
[1] Inf 1.0 0.5


Here R has tried to evaluate 1/0 and reports the
infinite result as “Inf.“

When there may be missing values, the is.na()
function should be used to detect them. For
instance,
> is.na(some.evens)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
FALSE TRUE FALSE TRUE FALSE TRUE 22 / 34

[16] FALSE TRUE FALSE TRUE FALSE


Packages, libraries, and
repositories

In R, a package is a module containing functions,
data, and documentation. R always contains the
base packages (e.g. base , stats , graphics );
these contain things that everyone will use.

There are also contributed packages (e.g. knitr
and chron); these are modules written by others
to use in R.

When you start your R session, you will have
some packages loaded and available for use,
while others are stored on your computer in a
23 / 34
library.
Check for loaded packages

To be sure a package is loaded, run code like
> library(knitr)
> search()
[1] ".GlobalEnv" "package:knitr" "tools:rstudio" "package:stats"
[5] "package:graphics" "package:grDevices" "package:utils" "package:datasets"
[9] "package:methods" "Autoloads" "package:base"


The function “search()” results in the names of
the loaded available packages.

The generated list provides the order by which
packages are searched for function calls.
24 / 34
Online repositories

Thousands of contributed packages are available,
though you may have only a few dozen installed on your
computer. If you try to use one that isn’t already there,
you will receive an error message:
> library(funStats)
Error in library(funStats) : there is no package called ‘funStats’

This means that the package doesn’t exist on your
computer, but it might be available in a repository online.
The biggest repository of R packages is known as CRAN.
To install a package from CRAN, you can run a command
like
install.packages("knitr")
25 / 34
Loading packages in RStudio


within RStudio, click on the Packages tab in the
Output Pane, choose Install, and enter the name
in the resulting dialog box.

26 / 34
Getting help


There are many functions in R which are
designed to do all sorts of things.

The online help facility can help you to see what
a particular function is supposed to do. There are
a number of ways of accessing the help facility.
– Type “help()” with the name of the function
between braces.
– Type “?” followed by the function name.
– Hit “F1” in RStudio. 27 / 34
Finding help when you do not
know the function name

One way to explore the help system is to use help.start() .
This brings up an Internet browser, such as Google
Chrome or Firefox.

The browser will show you a menu of several options,
including a listing of installed packages. (The base
package contains many of the routinely used functions;
other commonly used functions are in utils or stats)

You can get to this page within RStudio by using the Help
| R Help menu item.

28 / 34
Finding help when you do not
know the function name

Another function that is often used is help.search() ,
abbreviated as a double question mark. For example, to
see if there are any functions that do optimization
(finding minima or maxima), type
> ??optimization

> help.search("nonlinear programming")


Web search engines such as Google can also be useful
for finding help on R. Including ‘R’ as a keyword in such a
search will often bring up the relevant R help page.

29 / 34
Installing packages

The name of the R package that is needed is usually
listed at the top of the help page. You can usually
install them by typing
> install.packages("packagename")


This will work as long as the package is available in
the CRAN repository.

Google may also find discussions of similar
questions to yours on sites like
https://stackoverflow.com/, where discussions about
R are common.
30 / 34
Some built-in graphics functions

Basic plots such as the histogram, the scatterplot
and the pie chart are built-in in R. Try the codes
below
> hist(islands)
> x <- seq(1, 10)
> y <- x^2 - 10 * x
> plot(x,y)
> curve(expr = sin, from = 0, to = 6 * pi)
> curve(x^2 - 10 * x, from = 1, to = 10)
> pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
> names(pie.sales) <- c("Blueberry", "Cherry",
+ "Apple", "Boston Cream", "Other", "Vanilla Cream")
> pie(pie.sales) # default colours

31 / 34
Some elementary built-in
statistics functions

Basic statistics functions are built-in in R. The
following is a list of such functions.
median(x) # computes the median or 50th percentile of the data in x
var(x) # computes the variance of the data in x
summary(x) # computes several summary statistics on the data in x
length(x) # number of elements in x
min(x) # minimum value of x
max(x) # maximum value of x
pmin(x, y) # pairwise minima of corresponding elements of x and y
pmax(x, y) # pairwise maxima of x and y
range(x) # difference between maximum and minimum of data in x
IQR(x) # interquartile range: difference between 1st and 3rd
# quartiles of data in x

32 / 34
Lab Assignment #2

33 / 34
The paper airplane factory

34 / 34

You might also like