R Lab Manuals - Updated
R Lab Manuals - Updated
1
2
Group A
Sr. No. Title of Expt. Date of Date of Sign. of
Performance Completion Teacher
1. Introduction to R
2. Programming Using R
3. Lists and Frames
4. Import and Export Files in R
5. Mathematical and Statistical Concepts in
R
Group B
3
Group C
Sr. No. Title of Expt. Date of Date of Sign. of
Performance Completion Teacher
1 For Iris dataset visualize data using plot()
also perform filter(), select(),mutate(),
arrange() functions
2 Write a R program that will identify and
remove the missing values from
datasets using frequency mean, median
or mode
options..
3 Write a R program that will identify outliers
and remove outliers from dataset
4 Using lm() function, perform linear
regression on the dataset
5 Write a R script to predict classification of
values using decision trees
4
Group A
Experiment 1:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Introduction to R Language
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University
of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is
freely available under the GNU General Public License, and pre-compiled binary versions are
provided for various operating systems like Linux, Windows and Mac. This programming language
was named R, based on the first letter of first name of the two R authors (Robert Gentleman and
Ross Ihaka), and partly a play on the name of the Bell Labs Language S.
Features of R
As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting. The following are the important features of R −
R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
How to install Rstudio on windows 10?
step by step process to download and #install r and #rstudio on #windows 10 OS (Operating System)
and also how you can run r program in rstudio.
First install R Software Back End
The Comprehensive R Archive Network (r-project.org)
https//cran.r-project.org
Link 1 : https://cran.r-project.org/bin/window...
5
Then install R Studio IDE Front End
Download the RStudio IDE - RStudio
https://rstudio.com/products/rstudio/download/
Link 2 : https://rstudio.com/products/rstudio/...
Questions 1
1. Go to the R website
Visit: https://cran.r-project.org
2. Click on "Download R for Windows"
3. Select "base" (for the standard R installation)
4. Click on "Download R-x.x.x for Windows"
(Replace x.x.x with the latest version number)
5. Run the downloaded installer
6. Follow the installation wizard:
o Click Next
o Choose installation path (default is fine)
o Select components (default is fine)
o Choose your preferred language
o Finish installation
7. Open R from the Start menu or search bar
1. Free & Open Source – R is completely free to use and open for customization.
2. Statistical Analysis – Built for statistics, data analysis, and mathematical modeling.
3. Data Handling – Efficient in handling and storing large datasets.
4. Data Visualization – Creates high-quality graphs (histograms, pie charts, etc.).
5. Extensive Packages – Thousands of packages for ML, stats, bioinformatics, etc.
6. Cross-Platform – Works on Windows, Mac, and Linux.
7. Active Community – Large community support and documentation.
8. Interpreted Language – No compilation needed; runs line by line.
9. Integration – Can integrate with C, C++, Java, Python, and databases.
10. Reproducible Reports – Use R Markdown and Shiny for reports and web apps.
6
Experiment 2:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Programming Using R
R - Data Types
Generally, while doing programming in any programming language, you need to use various variables
to store various information. Variables are nothing but reserved memory locations to store values.
This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer, floating
point, double floating point, Boolean etc. Based on the data type of a variable, the operating system
allocates memory and decides what can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are not declared as some
data type. The variables are assigned with R-Objects and the data type of the R-object becomes the
data type of the variable. There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Vectors
When you want to create vector with more than one element, you should use c() function which means
to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
7
# Get the class of the vector.
print(class(apple))
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The
array function takes a dim attribute which creates the required number of dimension. In the below
example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
8
Question: State and explain various data types in R ?
Example:
x <- 10
class(x) # "numeric"
9
Experiment 3:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Lists and Frames in R
R - Lists
Lists are the R objects which contain elements of different types like − numbers, strings, vectors
and another list inside it. A list can also contain a matrix or a function as its elements. List is
created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
12
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.
Create Data Frame
14
Que: Compare List and Frame in R
Ans.
Feature List Data Frame
Definition A collection of elements of different types A table-like structure with rows and columns
Structure 1D, like a container 2D, like a spreadsheet
Elements Can contain numbers, vectors, other lists, etc. Each column is a vector of the same length
Access By index (x[[1]]) or name (x$name) By column name (df$col) or index (df[,1])
Row/Col Format Not arranged in rows and columns Arranged in rows and columns
Use Case Flexible structure for mixed data Used for storing tabular data
Can have names? Yes, each element can have a name Yes, columns and rows can have names
Example:
List:
mylist <- list(name = "Harsha", age = 21, scores = c(85, 90, 88))
Data Frame:
mydf <- data.frame(name = c("Harsha", "Ravi"),
age = c(21, 22),
marks = c(90, 85))
15
Experiment 4:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Import and Export Files in R
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file
formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file
should be present in current working directory so that R can read it. Of course we can also set our
own directory and read files from there.
R - CSV Files
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current
working directory −
R - Excel File
Install xlsx Package
You can use the following command in the R console to install the "xlsx" package. It may ask to
install some additional packages on which this package is dependent. Follow the same command
with required package name to install the additional packages.
install.packages("xlsx")
16
The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a
data frame in the R environment.
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
Que: Explain how you will import and export the data file in R?
CSV File:
library(readxl)
data <- read_excel("path/to/your/file.xlsx")
Text File:
load("path/to/your/file.RData")
2. Exporting Data in R:
CSV File:
write.csv(data, "path/to/your/output.csv")
library(writexl)
write_xlsx(data, "path/to/your/output.xlsx")
Text File:
RData:
17
Experiment 5:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Mathematical and Statistical Concepts in R
R - Mean, Median and Mode
Statistical analysis in R is performed by using many in-built functions. Most of these functions are
part of the R base package. These functions take R vector as an input along with the arguments and
give the result.
The functions we are discussing in this chapter are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input
vector. Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
18
Applying Trim Option
When trim parameter is supplied, the values in the vector get sorted and then the required numbers
of observations are dropped from calculating the mean.
When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the
vector for calculating mean are (−21,−5,2) from left and (12,18,54) from right.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
Median
The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
x is the input vector.
na.rm is used to remove the missing values from the input
vector. Example
Mode
19
The mode is the value that has highest number of occurrences in a set of data. Unike mean and
median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value
as output.
Example
R - Pie Charts
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a
representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
20
Following is the description of the parameters used −
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will create
and save the pie chart in the current R working directory.
21
Pie Chart Title and Colors
We can expand the features of the chart by adding more parameters to the function. We will use
parameter main to add a title to the chart and another parameter is col which will make use of
rainbow colour pallet while drawing the chart. The length of the pallet should be same as the
number of values we have for the chart. Hence we use length(x).
Example
The below script will create and save the pie chart in the current R working directory.
22
Slice Percentages and Chart Legend
We can add slice percentage and a chart legend by creating additional chart variables.
piepercent<- round(100*x/sum(x), 1)
23
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a
function called pie3D() that is used for this.
# Get the library.
library(plotrix)
24
R - Bar Charts
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal
bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Example
A simple bar chart is created using just the input vector and the name of each bar.
25
The below script will create and save the bar chart in the current R working directory.
26
# Create the data for the chart
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
27
# Give the chart file a name
png(file = "barchart_stacked.png")
R - Boxplots
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into
three quartiles. This graph represents the minimum, maximum, median, first quartile and third
quartile in the data set. It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
28
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
x is a vector or a formula.
data is the data frame.
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
Example
We use the data set "mtcars" available in the R environment to create a basic boxplot. Let's look at
the columns "mpg" and "cyl" in mtcars.
29
dev.off()
When we execute the above code, it produces the following result −
30
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
R - Histograms
Frequency distribution
in statistics provides the information of the number of occurrences (frequency) of distinct values distributed
within a given period of time or interval, in a list, table, or graphical representation. Grouped and Ungrouped
are two types of Frequency Distribution.
33
When we execute the above code, it produces the following result –
R - Line Graphs
A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are
usually used in identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw
both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O". The below
script will create and save a line chart in the current R working directory.
34
Line Chart Title, Color and Labels
The features of the line chart can be expanded by using additional parameters. We add color to the
points and lines, give a title to the chart and add labels to the axes.
Example
35
Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using the lines()function.
After the first line is plotted, the lines() function can use an additional vector as input to draw the
second line in the chart,
36
R - Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of
two variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
37
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's use the
columns "wt" and "mpg" in mtcars.
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
38
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable
versus the remaining ones we use scatterplot matrix. We use pairs() function to create matrices of
scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
Following is the description of the parameters used −
formula represents the series of variables used in pairs.
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each pair.
39
# Give the chart file a name.
png(file = "scatterplot_matrices.png")
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix")
40
Que: What is Pi chart, bar graph, line graph and scatter plot?
Ans.
1. Pie Chart
A pie chart is a circular chart divided into slices to illustrate numerical proportions. Each slice
of the pie represents a category's contribution to the whole, usually expressed as a percentage.
The entire circle represents 100%.
2. Bar Graph
A bar graph (or bar chart) is a visual representation of data using rectangular bars. The length or
height of each bar is proportional to the value it represents. Bar graphs are useful for comparing
different categories or groups.
3. Line Graph
A line graph displays data points connected by straight lines. It is commonly used to show
trends over time, such as changes in temperature, sales, or population. The x-axis usually
represents time, and the y-axis shows the values.
4. Scatter Plot
A scatter plot is a graph that shows individual data points plotted on a two-dimensional
coordinate system. Each point represents a pair of values. Scatter plots are useful for identifying
relationships or correlations between two variables.
41
Group B
Experiment 1:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program that swaps any two numbers without using any third
number.
Algorithm
1. STEP 1: START.
2. STEP 2: ENTER x, y.
3. STEP 3: PRINT x, y.
4. STEP 4: x = x + y.
5. STEP 5: y= x - y.
6. STEP 6: x =x - y.
7. STEP 7: PRINT x, y.
8. STEP 8: END.
Script in R
#Expt 1 Swapping two numerbs without third
number x <- as.integer(readline(prompt = "Enter x
value :")) y <- as.integer(readline(prompt = "Enter y
value :"))
x=x+y
y=x-y
x=x-y
43
Experiment 2:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program script using for, while and repeat loop that prints the
value of i from 1 to 10.
R - Loops
There may be a situation when you need to execute a block of code several number of times. In
general, statements are executed sequentially. The first statement in a function is executed first,
followed by the second, and so on.
Programming languages provide various control structures that allow for more complicated
execution paths.
A loop statement allows us to execute a statement or group of statements multiple times and the
following is the general form of a loop statement in most of the programming languages −
R programming language provides the following kinds of loop to handle looping requirements.
Click the following links to check their detail.
1 repeat loop
44
Executes a sequence of statements multiple times and abbreviates the code that manages
the loop variable.
2 while loop
Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.
3 for loop
Like a while statement, except that it tests the condition at the end of the loop body.
The Repeat loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a repeat loop in R is −
repeat {
commands
if(condition) {
break
}
}
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
A For loop is a repetition control structure that allows you to efficiently write a loop that needs to
execute a specific number of times.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
45
Loop Control Statements
Loop control statements change execution from its normal sequence. When execution leaves a
scope, all automatic objects that were created in that scope are destroyed.
R supports the following control statements. Click the following links to check their detail.
1 break statement
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
2 Next statement
Script in R
for (i in 1:10)
print(i)
print("Use of While
Loop") i<-1
while(i<11)
print(i)
i=i+1
46
print("Use of Repeat Loop")
i<-1
repeat
print(i)
i=i+1
break
47
Experiment 3:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program script to find the factorial of any given number using a
recursive Function
R - Functions
Function Definition
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows −
function_name <- function(arg_1, arg_2, ...)
{ Function body
}
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once created
they can be used like the built-in functions. Below is an example of how a function is created and
used.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a)
{ b <- i^2
print(b)
}
}
Calling a Function
48
new.function <- function(a)
{ for(i in 1:a) {
b <- i^2
print(b)
}
}
49
new.function(9,5)
Script in R
#recur_factorial(n)
{ if(n <= 1) {
return(1)
} else {
return(n * recur_factorial(n-1))
a=recur_factorial(n)
print(a)
50
Experiment 4:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program that reads the csv file. Find the maximum and
minimum values among all three.
R - CSV Files
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file
formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file
should be present in current working directory so that R can read it. Of course we can also set our
own directory and read files from there.
Script R
#Load the
data
tit<-read.csv("train.csv", header=TRUE)
View(tit)
52
tit$SurvivedLabel<-ifelse(tit$Survived==1, "Survived","Died")
53
View(tit)
tit$FamilySize<-1+tit$SibSp+tit$Parch
View(tit)
str(tit)
#Apply row Filter tot the titanic data frame - will return only males
males<-tit[tit$Sex=="male",]
View(males)
#Apply row Filter tot the titanic data frame - will return only Fe-males
females<-tit[tit$Sex=="female",]
View(females)
summary(males$Fare)
var(males$Fare)
sd(males$Fare)
sum(males$Fare)
length(males$Fare)
#install.packages("ggplot2")
#library(ggplot2)
View(iris)
write.csv(tit, 'ssp.csv')
write.csv(iris,'irisdata.csv')
54
Experiment 5:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Using the various built functions plot pie chart, scatter plot, histogram and
line charts
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will create
and save the pie chart in the current R working directory.
55
# Plot the chart.
pie(x,labels)
Example
The below script will create and save the pie chart in the current R working directory.
piepercent<- round(100*x/sum(x), 1)
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a
function called pie3D() that is used for this.
# Get the library.
library(plotrix)
R - Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of
two variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
x is the data set whose values are the horizontal coordinates.
57
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's use the
columns "wt" and "mpg" in mtcars.
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Scatterplot Matrices
58
When we have more than two variables and we want to find the correlation between one variable
versus the remaining ones we use scatterplot matrix. We use pairs() function to create matrices of
scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
Following is the description of the parameters used −
formula represents the series of variables used in pairs.
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each pair.
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix")
R - Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
60
R - Line Graphs
A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are
usually used in identifying the trends in data.
Syntax
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw
both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O". The below
script will create and save a line chart in the current R working directory.
# Create the data for the chart.
v <- c(7,12,28,3,41)
62
Group C
Experiment 1:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: For Iris dataset visualize data using plot() also perform filter(), select(),
mutate(), arrange() functions
filter() allows you to subset observations based on their values. The first argument is the name of the
data frame. The second and subsequent arguments are the expressions that filter the data frame. For
example, we can select all flights on January 1st with:
mutate() function
arrange() orders the rows of a data frame by the values of selected columns.
library(dplyr)
View(iris)
View(df)
#Visualization
plot(iris)
plot(iris$sepal.width, iris$sepal.length)
63
hist(iris$sepal.width)
#filter()
names(iris)
library(dplyr)
virginica
#We can also filter for multiple conditions within our function.
sepalLength6
#select()
#This function selects data by column name. You can select any number of columns
in a few different ways.
selected
64
selected2 <- select(iris, sepal.length:petal.length)
head(selected, 3)
#mutate()
# create a new column that stores logical values for sepal.width greater than half of
sepal.length
tail(newCol)
newCol
# arrange()
head(newCol)
newCol
65
Experiment 2:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program that will identify and remove the missing values from
datasets using frequency mean, median or mode options.
Missing values in data science arise when an observation is missing in a column of a data
frame or contains a character value instead of numeric value. Missing values must be
dropped or replaced in order to draw correct conclusion from the data.
#identify and remove the missing values from datasets using frequency mean, median or
mode options.
mean(age)
a<-read.csv("123.csv")
View(a)
mean(a$age)
#The complete.cases function detects rows in a data.frame that do not contain any
missing value.
complete.cases(a)
recores b<-na.omit(a)
View(b)
66
}
sapply(a, is.special)
67
Experiment 3:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
Aim: Write a R program that will identify outliers and remove outliers from
dataset
An outlier is a value or an observation that is distant from other observations, that is to say, a
data point that differs significantly from other data points. Enderlein (1987) goes even further as the
author considers outliers as values that deviate so much from other observations one might suppose a
different underlying sampling mechanism.
#R program that will identify outliers and remove outliers from dataset
boxplot.stats(x)$ou
out_ind
68
Experiment 4:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables. One of these variable is called predictor variable whose value is gathered through
experiments. The other variable is called response variable whose value is derived from the predictor
variable.
y = ax + b
69
Find the coefficients from the model created and create the mathematical equation using
these
Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
To predict the weight of new persons, use the predict() function in R
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients Create
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Get the Summary of the Relationship
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
70
print(summary(relation))
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons
Script in R
# Values of height
#151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
72
print(relation)
print(summary(relation))
#predict() Function
print(result)
png(file = "linearregression.png")
dev.off()
73
Experiment 5:
Date of Performance:
Date of Completion:
Grade:
Signature of Teacher:
R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in the
graph represent an event or choice and the edges of the graph represent the decision rules or
conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Install R Package
Use the below command in R console to install the package. You also have to install the dependent
packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Following is the description of the parameters used −
formula is a formula describing the predictor and response variables.
data is the name of the data set
used. Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It describes the
score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the
person is a native speaker or not.
Here is the sample data.
# Load the party package. It will automatically load other
# dependent packages.
library(party)
74
# Print some records from data set readingSkills.
print(head(readingSkills))
When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other
# dependent packages.
library(party)
as.Date, as.Date.numeric
76
Script in R
#The package "party" has the function ctree() which is used to create and analyze
decison tree.
#install.packages("party")
# dependent packages.
library(party)
77
print(head(readingSkills))
#We will use the ctree() function to create the decision tree and see its graph.
readingSkills
write.csv(readingSkills, "readingskills.csv")
data1<-read.csv("readingskills.csv")
png(file = "decision_tree.png")
#ctree(formula, data)
data = input.dat)
plot(output.tree)
dev.off()
**********
78