Unit 03 Notes
Unit 03 Notes
SNEHAL K JOSHI
‘R – Programming’
Data analysis is used across various fields, including business, healthcare, social sciences, engineering, and more, to
make informed decisions, optimize processes, and gain insights into complex phenomena.
1|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
1. Descriptive Statistics: Summarizing data through measures such as mean, median, mode, variance,
standard deviation, and correlation. This provides a basic understanding of the data's distribution and
central tendency.
2. Inferential Statistics: Making predictions or inferences about a population based on a sample of data.
Techniques include hypothesis testing, confidence intervals, regression analysis, and ANOVA (Analysis of
Variance).
3. Probability Theory: Assessing the likelihood of events occurring within a given data set. This includes the
use of probability distributions, such as the normal distribution, to model data.
4. Correlation and Causation: Analyzing the strength and direction of relationships between variables and
determining whether one variable causes changes in another.
5. Statistical Modeling: Building models to explain the relationships between variables, such as linear
regression models for predicting outcomes.
6. Hypothesis Testing: Evaluating assumptions or claims about a population by analyzing sample data. This
involves tests like t-tests, chi-square tests, and p-values to determine the statistical significance of results.
Statistical data analysis is crucial for making evidence-based decisions and for understanding the underlying
patterns and relationships in data.
2|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
R’s packages like `stats`, `MASS`, `ggplot2`, and `lme4` offer tools for virtually every type of statistical
analysis.
3. Data Visualization:
R excels in data visualization, with packages like `ggplot2`, `plotly`, and `lattice` providing tools to create
complex and publication-quality plots.
Visualizations like histograms, scatter plots, bar charts, box plots, and heatmaps help in exploring data
trends and relationships.
4. Reproducibility:
R scripts and notebooks (e.g., R Markdown) allow analysts to document their analysis process, making it
easier to reproduce and share results.
This is crucial in research and business environments where consistent and transparent analysis is
required.
6. Extensibility:
R has a vast ecosystem of packages (over 18,000 on CRAN) that extend its capabilities in data analysis.
Whether you need specialized statistical tests, machine learning models, or bioinformatics tools, R has a
package for it.
The community-driven nature of R ensures continuous updates and the availability of cutting-edge
techniques.
3|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
Example:
Consider analyzing the `marks.csv` dataset using R. You could start by importing the data, cleaning it (e.g.,
handling missing values), performing descriptive statistics (e.g., finding the mean and standard deviation of
students' marks), and visualizing the distribution of marks using histograms. Then, you could build a linear
regression model to predict student performance based on their marks in differen t subjects. All of this can
be done efficiently using R’s extensive statistical and data manipulation capabilities.
In summary, R is a versatile and powerful tool for both data analysis and statistical analysis, offering a
comprehensive suite of tools for data manipulation, statistical modeling, and data visualization.
Use `#` to add comments in your code. Anything after `#` on a line will be ignored by R.
```r
# This is a comment
```
2. Variables and Assignment
Assign values to variables using the `<-` operator or the `=` operator.
```r
x <-10 # Assign 10 to x
y = 5 # Assign 5 to y
```
3. Basic Data Types
Numeric: Represents numbers (e.g., integers, floating-point numbers).
```r
num <-42 # Numeric
```
Character: Represents text strings.
```r
str <-"Hello" # Character
```
4|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
Vectors are a basic data structure in R, which hold elements of the same type.
Create a vector using the `c()` function.
```r
vec <-c(1, 2, 3, 4, 5) # Numeric vector
names <-c("Alice", "Bob", "Charlie") # Character vector
```
5. Operations on Vectors
```r
vec2 <vec * 2 # Multiplies each element by 2
sum_vec <-vec + vec2 # Adds corresponding elements
```
6. Sequences and Repetitions
Create sequences using `:` or the `seq()` function.
```r
seq1 <-1:10 # Sequence from 1 to 10
seq2 <-seq(1, 10, by=2) # Sequence from 1 to 10 with step 2
```
Repeat elements using the `rep()` function.
```r
rep_vec <-rep(1:3, times=3) # Repeats 1, 2, 3 three times
```
7. Matrices
Matrices are two-dimensional arrays that contain elements of the same type.
```r
mat <-matrix(1:9, nrow=3, ncol=3) # 3x3 matrix
```
8. Lists
5|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
9. Data Frames
Data frames are used to store tabular data, with rows representing observations and columns representing
variables.
df <-data.frame(
name = c("Rutvik", "Pankti", "Evaan"),
age = c(24, 19, 8),
score = c(90, 85, 92)
)
10. Accessing Elements
Use brackets `[]` to access elements in vectors, matrices, and data frames.
```r
vec[1] # First element of vector
mat[1,2] # Element in the first row, second column of matrix
df$name # Access the 'name' column of the data frame
df[1, "age"] # Access the 'age' value of the first row
```
11. Conditional Statements
Use `if`, `else if`, and `else` to control the flow of your program.
```r
x <-10
if (x > 0) {
print("Positive")
} else if (x == 0) {
print("Zero")
} else {
print("Negative")
}
```
12. Loops
Use `for` and `while` loops for repetitive tasks.
# for loop
for (i in 1:5) {
print(i)
}
# while loop
count <-1
while (count <= 5) {
print(count)
count <count + 1
}
6|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
13. Functions
Define functions using the `function` keyword.
```r
add <-function(a, b) {
return(a + b)
}
result <-add(3, 4) # Calls the function with arguments 3 and 4
```
14. Packages
7|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
Examples
1. Basic Use
8|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
Output:
```
[1] (18,36.3] (36.3,54.7] (18,36.3] (54.7,73] (18,36.3] (36.3,54.7] (18,36.3] (36.3,54.7]
Levels: (18,36.3] (36.3,54.7] (54.7,73]
```
Here, `cut` divides the data into 3 intervals and assigns each age to one of these intervals.
2. Specifying Breakpoints
```
Output:
[1] Young Adult Child Senior Young Adult Young Adult Child Senior
Levels: Child Young Adult Senior```
9|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
5. Closed Intervals
By default, the intervals are closed on the right, meaning that the interval `(a,b]` includes `b` but not `a`. To
close the interval on the left instead, set `right = FALSE`:
```r
age_groups <-cut(ages, breaks=c(18, 35, 50, 100), right=FALSE)
print(age_groups)
```
Use Case
The `cut` function is often used in data preprocessing to transform continuous variables into categorical
variables.
For example, if you're working with age data and want to categorize it into age groups (e.g., Child, Teen,
Adult), `cut` is a convenient way to achieve this.
import pandas as pd
data = {
"student_name": [
"John Doe", "Jane Smith", "Michael Brown", "Emily Davis", "Chris Johnson",
"Sarah Lee", "Matthew Taylor", "Jessica Wilson", "David White", "Laura Moore",
"James Harris", "Sophia Thompson", "Daniel Martin", "Olivia Clark", "Henry Lewis",
"Emma Walker", "Liam Hall", "Ava Young", "Mason Allen", "Isabella King"
],
"subject_1": [78, 85, 90, 95, 82, 88, 76, 91, 84, 87, 89, 93, 81, 77, 92, 86, 79, 80, 83, 94],
"subject_2": [88, 79, 92, 81, 94, 85, 87, 90, 77, 82, 91, 84, 80, 89, 83, 86, 93, 78, 92, 88],
"subject_3": [90, 88, 85, 92, 81, 87, 89, 84, 93, 91, 86, 79, 88, 80, 77, 82, 94, 92, 79, 83],
df = pd.DataFrame(data)
10 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
df["percentage"] = df["total"] / 3
df["class"] = pd.cut(df["percentage"], bins=[0, 50, 60, 70, 80, 90, 100], labels=["F", "D", "C", "B", "A", "A+"])
file_path = "/mnt/data/student_records.csv"
df.to_csv(file_path, index=False)
file_path
RCode :
# Create data
"Mitesh", "Payal")
subject_1 <-c(78, 85, 90, 95, 82, 88, 76, 91, 84, 87, 89, 93, 81, 77, 92, 86, 79, 80, 83, 94)
subject_2 <-c(88, 79, 92, 81, 94, 85, 87, 90, 77, 82, 91, 84, 80, 89, 83, 86, 93, 78, 92, 88)
subject_3 <-c(90, 88, 85, 92, 81, 87, 89, 84, 93, 91, 86, 79, 88, 80, 77, 82, 94, 92, 79, 83)
class <- cut(percentage, breaks=c(0, 50, 60, 70, 80, 90, 100),
students_df <- data.frame(student_id, student_name, subject_1, subject_2, subject_3, total, percentage, class)
# Write to CSV
11 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
Q. Write a R code to read records from the csv file exists in a folder in my
desktop and display the records one by one.
Answer :
# Define the file path
file_path <-"content/marks.csv"
students_df <-read.csv(file_path)
for (i in 1:nrow(students_df)) {
print(students_df[i, ])
Q. Write a R code to write records in the existing csv file exists in a folder in my
desktop and display total records available in the csv file after adding new
record in it.
Answer :
# Define the file path
file_path <"/content/marks.csv"
students_df <read.csv(file_path)
new_record <data.frame(
student_id = "S21",
student_name = "Lily Scott",
subject_1 = 85,
subject_2 = 90,
subject_3 = 88,
total = 85 + 90 + 88,
percentage = (85 + 90 + 88) / 3,
class = cut((85 + 90 + 88) / 3, breaks=c(0, 50, 60, 70, 80, 90, 100), labels=c("F", "D", "C", "B", "A", "A+"))
)
12 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
cat("Total number of records in the CSV file after adding the new record:", nrow(students_df), "\n")
These steps help ensure that the dataset you're working with is accurate, consistent, and relevant, which ultimately
leads to better analysis and more reliable results.
Example:
Suppose we have a dataset of customer orders from an e-commerce platform. The dataset includes the following
columns:
# Load dataset in R
orders <read.csv("orders.csv")
13 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# View the first few rows of the dataset
head(orders)
# Check the structure of the dataset
str(orders)
# Summary statistics for numerical columns
summary(orders)
```
```r
# Check for missing values
sum(is.na(orders))
Remove Rows/Columns: If missing values are few and scattered, you can remove the affected rows or columns.
Impute Missing Values: Replace missing values with the mean, median, mode, or other relevant values.
```r
# Remove rows with missing values
orders_clean <na.omit(orders)
# Impute missing values in 'Quantity' with median
orders$Quantity[is.na(orders$Quantity)] <median(orders$Quantity, na.rm = TRUE)
```
Step 5: Correcting Data Types
Ensure that each column has the correct data type (e.g., numeric, factor, date).
```r
# Convert 'OrderDate' to Date type
orders$OrderDate <as.Date(orders$OrderDate, format="%Y-%m-%d")
14 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# Check for duplicates based on all columns
sum(duplicated(orders))
# Remove duplicate rows
orders_clean <orders[!duplicated(orders), ]
```
Step 7: Filtering Data
Filtering involves selecting a subset of data that is relevant for your analysis.
Example 1:
Filter out canceled orders since they may not be relevant to the analysis.
```r
# Filter to include only completed and pending orders
orders_clean <subset(orders_clean, Status != "Canceled")
```
Example 2:
```r
# Filter orders from 2023 only
orders_2023 <subset(orders_clean, OrderDate >= "2023-01-01" & OrderDate <= "2023-12-31")
```
Step 8: Correcting Data Inconsistencies
Look for inconsistencies, such as incorrect or misspelled entries, and correct them.
```r
# Check for inconsistent entries in 'Status'
table(orders_clean$Status)
15 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# Identify outliers in 'TotalAmount' using IQR method
Q1 <quantile(orders_clean$TotalAmount, 0.25)
Q3 <quantile(orders_clean$TotalAmount, 0.75)
IQR <Q3 Q1
# Define lower and upper bounds
lower_bound <Q1 1.5 * IQR
upper_bound <Q3 + 1.5 * IQR
# Filter out the outliers
orders_no_outliers <subset(orders_clean, TotalAmount >= lower_bound & TotalAmount <= upper_bound)
```
Step 10: Recalculating Derived Columns
If your dataset has derived columns (like `TotalAmount`), recalculate them to ensure consistency.
```r
# Recalculate 'TotalAmount'
orders_clean$TotalAmount <orders_clean$Quantity * orders_clean$Price
```
Step 11: Final Data Check
After cleaning, perform a final check to ensure the dataset is ready for analysis.
```r
# Check summary statistics again
summary(orders_clean)
# Validate that there are no more missing values
sum(is.na(orders_clean))
```
Step 12: Saving the Cleaned Data
Save the cleaned dataset for further analysis.
```r
# Save the cleaned dataset to a new file
write.csv(orders_clean, "orders_clean.csv", row.names=FALSE)
```
Summary of Steps:
1. Understand the Dataset: Know what data you have.
5. Correcting Data Types: Ensure correct data types for each column.
16 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
11. Final Data Check: Verify the data is clean and consistent.
Following these steps ensures your data is clean, accurate, and ready for analysis.
17 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# Install and load the readxl package
install.packages("readxl")
library(readxl)
18 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# Install and load the dplyr package
install.packages("dplyr")
library(dplyr)
19 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
```r
# Identifying missing data
missing_data_summary <colSums(is.na(marks_df))
print(missing_data_summary)
20 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
6. What is RStudio?
A) A programming language
B) An integrated development environment (IDE) for R
C) A data analysis tool only for Python
D) A web browser extension
Answer: B) An integrated development environment (IDE) for R
21 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
22 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
18. What is the function to import data from an Excel sheet using the `readxl` package?
A) `import_excel()`
B) `excel_read()`
C) `read_excel()`
D) `read_xlsx()`
Answer: C) `read_excel()`
19. If you have a CSV file named "data.csv", how would you read it into R?
A) `read.data("data.csv")`
B) `read.csv("data.csv")`
C) `csv.read("data.csv")`
D) `load.csv("data.csv")`
Answer: B) `read.csv("data.csv")`
20. Which function would you use to save a data frame into a CSV file in R?
A) `write.csv()`
B) `export.csv()`
C) `save.csv()`
D) `store.csv()`
Answer: A) `write.csv()`
21. What is the function to view the first few rows of a data frame in R?
A) `start()`
B) `head()`
C) `top()`
D) `view()`
Answer: B) `head()`
23 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
24. Which function is used to write a data frame into a CSV file?
A) `save.csv()`
B) `write.csv()`
C) `store.csv()`
D) `write.dataframe()`
Answer: B) `write.csv()`
25. Which of the following is NOT a way to view the contents of a data frame in R?
A) `print()`
B) `summary()`
C) `tail()`
D) `scan()`
Answer: D) `scan()`
24 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
6. What is RStudio?
Answer: RStudio is an Integrated Development Environment (IDE) for R.
8. What is CRAN?
Answer: CRAN is the Comprehensive R Archive Network, used for installing R packages.
25 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI
18. How do you read an Excel file using the `readxl` package?
Answer: You use the `read_excel()` function.
21. Which function is used to view the first few rows of a data frame?
Answer: The `head()` function is used to view the first few rows of a data frame.
25. How do you view the last few rows of a data frame?
Answer: The `tail()` function is used to view the last few rows of a data frame.
28. What function would you use to print the entire data frame to the console?
Answer: The `print()` function is used to print the entire data frame.
26 | P a g e