[go: up one dir, main page]

0% found this document useful (0 votes)
9 views39 pages

R Intro

The document is an introduction to R, an open-source programming language for statistical computing and data analysis. It covers the advantages of using R, installation steps, the RStudio IDE, basic arithmetic, variable assignment, data structures, and examples of how to create and manipulate vectors, matrices, lists, and data frames. The document serves as a foundational guide for beginners to understand and utilize R for data analysis.

Uploaded by

nikhilkumar312nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views39 pages

R Intro

The document is an introduction to R, an open-source programming language for statistical computing and data analysis. It covers the advantages of using R, installation steps, the RStudio IDE, basic arithmetic, variable assignment, data structures, and examples of how to create and manipulate vectors, matrices, lists, and data frames. The document serves as a foundational guide for beginners to understand and utilize R for data analysis.

Uploaded by

nikhilkumar312nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

An Introduction to R

Dr. Sabyasachi Patra

2025-08-01

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 1 / 38


What is R?

R is a powerful open-source programming language and software


environment for statistical computing, data analysis, and graphics.

It was created by Ross Ihaka and Robert Gentleman at the University
of Auckland, New Zealand.
R is an implementation of the S programming language.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 2 / 38


Why Use R for Data Mining?

Comprehensive Vibrant Community


Extensive collection of tools for A massive, active community
data manipulation, analysis, and provides support through forums
modeling. (Stack Overflow) and blogs
Open-Source & Free (R-Bloggers).
No license fees. Anyone can use, Extensible
modify, and share it. Users can easily create and share
Visualization packages. The CRAN repository
has over 18,000 packages.
State-of-the-art graphics
capabilities (e.g., with the
ggplot2 package).

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 3 / 38


R vs. Other Tools
Feature R Python
Primary Use Statistical Analysis General Purpose, AI
Data Libraries dplyr, data.table pandas, numpy
Visualization ggplot2 (declarative) matplotlib (imperative)
Learning Steeper for Generally easier syntax
Curve non-programmers

Feature R Excel
Data Size Handles large datasets Limited by row/column
count
Reproducibility Excellent (via scripts) Poor (manual steps)
Advanced Built-in, extensive Requires add-ins, limited
Stats
Automation Strong Limited (VBA)

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 4 / 38


Installation Steps

1 Install R First
Go to the Comprehensive R Archive Network (CRAN).
Download R
Select your operating system (Windows, Mac, or Linux) and download
the latest version.
2 Then, Install RStudio
Go to the RStudio website.
Download RStudio Desktop
Download the free Open Source License version.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 5 / 38


A Tour of the RStudio IDE

RStudio is typically divided into four panes:

Source Editor (Top-Left): Where you write and save your R scripts
(.R).
Console (Bottom-Left): Where you can type and execute R code
directly. Output appears here.
Environment/History (Top-Right):
Environment: Shows all the objects (variables, data frames) in your
current session.
History: A record of all the commands you’ve run.
Files/Plots/Packages/Help (Bottom-Right):
Files: Browse files on your computer.
Plots: Displays any graphs you create.
Packages: Manage your installed packages.
Help: View documentation for functions.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 6 / 38


Working Directory

Your working directory is the default location where R will look for files
you want to import and where it will save files you want to export.
It is crucial to manage your working directory for reproducible
projects!

getwd(): Get (see) the current working directory.


setwd("path/to/your/directory"): Set the working directory.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 7 / 38


Working Directory: Example

# See my current working directory


getwd()

# Set my working directory to a project folder on my Desktop


# Note: Windows users might need to use double backslashes \\
# setwd("C:\\Users\\YourUser\\Desktop\\R_Project")
setwd("/Users/YourUser/Desktop/R_Project")

# Now if I run getwd() again, it will show the new path


getwd()

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 8 / 38


The R Console & Basic Arithmetic

The R console is where you type commands and see results.


Basic Arithmetic Operators:

+ (addition), - (subtraction), * (multiplication), / (division)


^ or ** (exponentiation)
%% (modulus - remainder from division)
%/% (integer division)

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 9 / 38


Basic Arithmetic: Examples
# Addition and Subtraction
5 + 3

[1] 8

10 - 4

[1] 6

# Multiplication and Division


6 * 7

[1] 42

100 / 4

[1] 25

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 10 / 38


Modulus and Integer Division: Examples
# Exponentiation
2^3 # 2 to the power of 3

[1] 8

5**2 # 5 squared

[1] 25

# Modulus (remainder)
10 %% 3 # Remainder of 10 divided by 3 is 1

[1] 1

# Integer Division
10 %/% 3 # How many times does 3 fully go into 10?

[1] 3
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 11 / 38
What is a Variable?

A variable is a name used to store a value or an R object. Once you store


a value in a variable, you can use the variable’s name to refer to that value
later.
Assignment Operators

<- (The “gets” operator): This is the most common and preferred
assignment operator in R.
= (Equals sign): Also works for assignment, but is typically reserved
for setting arguments in functions.

Convention: Use <- for all variable assignments.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 12 / 38


Variable Assignment: Examples
# Assign the value 42 to the variable 'x'
x <- 42
print(x)

[1] 42
# The value of 'x' can be used in calculations
y <- x / 2
print(y)

[1] 21
# Assign a character string
course_title <- "Data Mining with R"
print(course_title)

[1] "Data Mining with R"


# Re-assigning a new value to an existing variable
x <- 100
print(x)
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 13 / 38
Variable Naming Rules & Conventions

Rules (Must be followed):

Can contain letters, numbers, dots (.), and underscores (_).


Must start with a letter or a dot.
If it starts with a dot, it cannot be followed by a number.
R is case-sensitive! myVariable is different from myvariable.

Conventions (Good practice):

Use descriptive names: student_age is better than x.


Be consistent. Choose one style and stick to it.
customer_id, student_age (very common in R)
customerId, studentAge

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 14 / 38


Listing and Removing Variables
ls(): Lists all objects in your current workspace (environment).
rm(): Removes an object.

# Create some variables


var1 <- 10
var2 <- "hello"
var3 <- TRUE

# List all variables


ls()

[1] "course_title" "has_annotations" "var1" "var2"


[5] "var3" "x" "y"

# Remove a specific variable


rm(var2)

# See what's left


ls()

[1] "course_title" "has_annotations" "var1" "var3"


[5] "x" "y"

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 15 / 38


Special Values in R

NA: Not Available. Represents missing or undefined values.


NaN: Not a Number. Represents an impossible mathematical
operation (e.g., 0/0).
Inf and -Inf: Positive and negative infinity. Result from operations
like 1/0.
NULL: Represents the absence of anything. It’s different from NA.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 16 / 38


Overview of Data Structures

R has several fundamental data structures. They differ based on their


dimensionality and whether they can hold elements of mixed types.

Dimensions Homogeneous (Same Type) Heterogeneous (Mixed Types)


1D Atomic Vector List
2D Matrix Data Frame
nD Array (No direct nD equivalent)

We will also cover Factors, a special vector for categorical data.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 17 / 38


Atomic Vectors: The Building Blocks

The most basic data structure in R.


There are four main types you’ll use:

numeric (or double): The default for numbers, e.g., 10.5, 55.
integer: Numbers without decimals. Must be specified with an L,
e.g., 10L.
character: Text strings, e.g., "hello".
logical: TRUE or FALSE.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 18 / 38


Creating Vectors with c()
The c() function (for combine/concatenate) is the primary way to create vectors.

# A numeric vector
sales_figures <- c(150.5, 200, 99.99)
class(sales_figures) # Check the type

[1] "numeric"

# A character vector
regions <- c("North", "South", "East", "West")
class(regions)

[1] "character"

# A logical vector
is_profitable <- c(TRUE, TRUE, FALSE, TRUE)
class(is_profitable)

[1] "logical"

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 19 / 38


Vector Indexing: By Position
Access elements in a vector using square brackets []. R uses 1-based indexing.

grades <- c(88, 92, 75, 95, 68)

# Get the third element


grades[3]

[1] 75

# Get the first and fourth elements


grades[c(1, 4)]

[1] 88 95

# Get a slice of elements from 2 to 4


grades[2:4]

[1] 92 75 95

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 20 / 38


Vector Indexing: By Logic
A logical vector of the same length selects elements where the value is TRUE.

grades <- c(88, 92, 75, 95, 68)

# Get all elements EXCEPT the third one


grades[-3]

[1] 88 92 95 68

# Get all grades greater than 90


grades[grades > 90]

[1] 92 95

# Get all grades that are not 75


grades[grades != 75]

[1] 88 92 95 68

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 21 / 38


Vectorized Operations
A key feature of R is that operations are naturally “vectorized.” This means an operation
is applied element-by-element, making code fast and concise (no for loops needed).

a <- c(1, 2, 3)
b <- c(10, 20, 30)

# Element-wise addition
a + b

[1] 11 22 33

# Element-wise multiplication
a * b

[1] 10 40 90

# Vector recycling: shorter vector is repeated


a + c(10, 20) # a is c(1,2,3), so c(10,20) becomes c(10,20,10)

[1] 11 22 13
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 22 / 38
Matrices
A matrix is a 2-dimensional collection of elements of the same type. It has rows and
columns. Use the matrix() function to create one.

# Create a matrix with 6 elements, 2 rows, 3 columns


# R fills matrices by column by default
m1 <- matrix(1:6, nrow = 2, ncol = 3)
print(m1)

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6

# Create a matrix and fill it by row


m2 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
print(m2)

[,1] [,2] [,3]


[1,] 1 2 3
[2,] 4 5 6

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 23 / 38


Lists
A list is a very flexible 1D object. It’s an ordered collection of elements that can be of
different types.

# A list containing different types of objects


student_info <- list(
name = "Aisha",
student_id = 9876,
courses_taken = c("Stats", "R Programming", "Finance"),
grades_matrix = matrix(c(85, 90, 88, 92), nrow = 2)
)

str(student_info) # Use str() to see the structure

List of 4
$ name : chr "Aisha"
$ student_id : num 9876
$ courses_taken: chr [1:3] "Stats" "R Programming" "Finance"
$ grades_matrix: num [1:2, 1:2] 85 90 88 92

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 24 / 38


List Indexing: [[...]] vs $ vs [...]
[[...]] or $: Extracts a single element from the list. The result is the element
itself.
[...]: Extracts a sub-list. The result is always another list.

print(student_info$name) # Using $

[1] "Aisha"

print(student_info[["name"]]) # Using [[...]] by name

[1] "Aisha"

# Compare the class of the output


class(student_info[["name"]]) # Extracts the character vector

[1] "character"

class(student_info["name"]) # Extracts a list containing the vector

[1] "list"
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 25 / 38
Factors
A factor is a special vector used to store categorical data. R stores them efficiently as
integers with corresponding character labels (“levels”).
# A character vector of education levels
education_data <- c("Bachelors", "Masters", "PhD", "Masters", "Bachelors")

# Convert to a factor
education_factor <- factor(education_data)
print(education_factor)

[1] Bachelors Masters PhD Masters Bachelors


Levels: Bachelors Masters PhD
# Ordered factors capture a natural order
satisfaction <- factor(c("High", "Low", "Medium"),
levels = c("Low", "Medium", "High"),
ordered = TRUE)
print(satisfaction)

[1] High Low Medium


Levels: Low < Medium < High
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 26 / 38
Data Frames

The data frame is the most important data structure for data analysis in R. It’s a 2D
table where columns can have different types. Think of it as an Excel spreadsheet. �

# Create the data frame


products_df <- data.frame(
product_id = c(1, 2, 3),
product_name = c("Laptop", "Mouse", "Keyboard"),
price = c(75000, 1200, 2500),
in_stock = c(TRUE, TRUE, FALSE)
)
print(products_df)

product_id product_name price in_stock


1 1 Laptop 75000 TRUE
2 2 Mouse 1200 TRUE
3 3 Keyboard 2500 FALSE

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 27 / 38


Exploring Data Frames

Essential functions for quickly understanding a data frame:

str(): Display the structure of the data frame (Most useful first
look!).
head(): Show the first 6 rows.
tail(): Show the last 6 rows.
summary(): Provide a statistical summary of each column.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 28 / 38


str() and summary() in Action
# Use the built-in iris dataset for a better example
str(iris)

'data.frame': 150 obs. of 5 variables:


$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1

summary(iris[,1:4])

Sepal.Length Sepal.Width Petal.Length Petal.Width


Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Dr. Sabyasachi Patra An Introduction to R 2025-08-01 29 / 38
Data Frame Indexing: Selecting Columns
You can select columns in three main ways, similar to lists.

# Using $ (best for interactive use)


products_df$product_name

[1] "Laptop" "Mouse" "Keyboard"

# Using [...] (returns another data frame)


products_df["price"]

price
1 75000
2 1200
3 2500

# Select multiple columns


products_df[c("product_name", "price")]

product_name price
1 Laptop 75000
2 Mouse 1200
3 Keyboard 2500

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 30 / 38


Data Frame Indexing: Selecting Rows & Columns
Combine row and column selection using [row, column] notation to get specific values.

# Get rows 1 and 3


products_df[c(1, 3), ]

product_id product_name price in_stock


1 1 Laptop 75000 TRUE
3 3 Keyboard 2500 FALSE

# Get all products with a price > 2000


products_df[products_df$price > 2000, ]

product_id product_name price in_stock


1 1 Laptop 75000 TRUE
3 3 Keyboard 2500 FALSE

# Get the name and stock status for products cheaper than 3000
products_df[products_df$price < 3000, c("product_name", "in_stock")]

product_name in_stock
2 Mouse TRUE
3 Keyboard FALSE

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 31 / 38


Importing Data: read.csv()

The most common task is reading data from a CSV file.

# This code assumes 'student_grades.csv' is in your working directory

# A modern, robust way to read CSVs


grades_data <- read.csv(
"student_grades.csv",
header = TRUE, # First row contains column names
stringsAsFactors = FALSE, # Keep text as characters, not factors
na.strings = c("NA", "Missing", "") # Define what counts as NA
)

head(grades_data)

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 32 / 38


R Packages

Packages are collections of functions, data, and documentation that


extend R’s capabilities.
The workflow is simple:
1 Install: Download the package from CRAN to your computer. You
only do this once. install.packages("package_name")
2 Load: Load the package into your current R session to make its
functions available. You do this every time you start a new R session.
library(package_name)

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 33 / 38


Essential Packages for Data Mining

The tidyverse: An opinionated collection of R packages designed


for data science.
dplyr: Data manipulation (filter, arrange, mutate, etc.).
ggplot2: World-class data visualization.
readr: Fast reading of rectangular data (like CSVs).
data.table: An alternative to data frames for very fast aggregation
of large data.
caret: A comprehensive framework for building machine learning
models.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 34 / 38


Statistical Summary Functions

These are some of the most fundamental functions for data analysis. They work on
numeric vectors.

test_scores <- c(88, 92, 95, 75, 68, 88, 92, 100, NA)

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 35 / 38


Statistical Summary Functions

These are some of the most fundamental functions for data analysis. They work on
numeric vectors.

test_scores <- c(88, 92, 95, 75, 68, 88, 92, 100, NA)

Mean: mean(test_scores, na.rm = TRUE) -> 87.25


Median: median(test_scores, na.rm = TRUE) -> 90
Standard Deviation: sd(test_scores, na.rm = TRUE) -> 10.62
Variance: var(test_scores, na.rm = TRUE) -> 112.79
Sum: sum(test_scores, na.rm = TRUE) -> 698
Min/Max: min(test_scores, na.rm = TRUE) / max(test_scores, na.rm =
TRUE) -> 68 / 100
Range: range(test_scores, na.rm = TRUE) -> 68, 100

Note: You must use na.rm = TRUE to ignore missing (NA) values.

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 35 / 38


Character (String) Functions
Functions for working with text data.

my_string <- "Data Mining with R"

# Number of characters
nchar(my_string)

[1] 18

# Convert to upper/lower case


toupper(my_string)

[1] "DATA MINING WITH R"

tolower(my_string)

[1] "data mining with r"

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 36 / 38


Character (String) Functions

Functions for working with text data.

# Paste strings together


paste("Course", "Code", "OQM401", sep = "-")

[1] "Course-Code-OQM401"

paste0("Product", "_", 101) # Paste with no separator

[1] "Product_101"

# Replace substrings
gsub("Data", "Business", my_string)

[1] "Business Mining with R"

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 37 / 38


Control Flow: if-else & ifelse()
if-else: Executes code blocks based on a single condition.
ifelse(): A vectorized version that works on an entire vector.

# Standard if-else
score <- 45
if (score >= 50) {
"Pass"
} else {
"Fail"
}

[1] "Fail"

# Vectorized ifelse()
scores <- c(88, 45, 92, 65)
results <- ifelse(scores >= 50, "Pass", "Fail")
print(results)

[1] "Pass" "Fail" "Pass" "Pass"

Dr. Sabyasachi Patra An Introduction to R 2025-08-01 38 / 38

You might also like