[go: up one dir, main page]

0% found this document useful (0 votes)
28 views26 pages

Unit 03 Notes

This document provides an overview of data analysis and statistical data analysis, detailing key processes such as data collection, cleaning, transformation, exploratory analysis, and modeling. It emphasizes the utility of R programming in these analyses, highlighting its capabilities in statistical techniques, data manipulation, visualization, and reproducibility. Additionally, it includes basic R commands and examples of using the cut() function for categorizing data.

Uploaded by

ketangsuvagiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views26 pages

Unit 03 Notes

This document provides an overview of data analysis and statistical data analysis, detailing key processes such as data collection, cleaning, transformation, exploratory analysis, and modeling. It emphasizes the utility of R programming in these analyses, highlighting its capabilities in statistical techniques, data manipulation, visualization, and reproducibility. Additionally, it includes basic R commands and examples of using the cut() function for categorizing data.

Uploaded by

ketangsuvagiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-3 : NOTES BY : DR.

SNEHAL K JOSHI

‘R – Programming’

Q. What is Data Analysis?


Answer :
Data Analysis is the process of systematically applying statistical and logical techniques to describe, summarize, and
evaluate data. It involves inspecting, cleaning, transforming, and modeling data with the goal of discovering useful
information, drawing conclusions, and supporting decision-making.

Key aspects of data analysis include:

1. Data Collection: Gathering relevant data from various sources.


2. Data Cleaning: Removing or correcting errors, handling missing values, and ensuring data consistency.
3. Data Transformation: Converting data into a suitable format for analysis, such as normalizing, scaling, or encoding
variables.
4. Exploratory Data Analysis (EDA): Visualizing and summarizing the main characteristics of the data to understand
patterns, trends, and relationships.
5. Modeling and Interpretation: Applying statistical models and algorithms to make predictions, infer relationships,
and draw conclusions.
6. Reporting: Communicating the results of the analysis through reports, visualizations, and dashboards to inform
decision-making.

Data analysis is used across various fields, including business, healthcare, social sciences, engineering, and more, to
make informed decisions, optimize processes, and gain insights into complex phenomena.

1|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Q. What is Statistical Data Analysis?


Answer :
➢ Statistical Data Analysis is a specific subset of data analysis.
➢ Focuses on applying statistical techniques to understand, interpret, and draw conclusions from data.
➢ Involves using mathematical methods to analyze data sets, assess relationships, and test hypotheses.

Key components of statistical data analysis include:

1. Descriptive Statistics: Summarizing data through measures such as mean, median, mode, variance,
standard deviation, and correlation. This provides a basic understanding of the data's distribution and
central tendency.

2. Inferential Statistics: Making predictions or inferences about a population based on a sample of data.
Techniques include hypothesis testing, confidence intervals, regression analysis, and ANOVA (Analysis of
Variance).

3. Probability Theory: Assessing the likelihood of events occurring within a given data set. This includes the
use of probability distributions, such as the normal distribution, to model data.

4. Correlation and Causation: Analyzing the strength and direction of relationships between variables and
determining whether one variable causes changes in another.

5. Statistical Modeling: Building models to explain the relationships between variables, such as linear
regression models for predicting outcomes.

6. Hypothesis Testing: Evaluating assumptions or claims about a population by analyzing sample data. This
involves tests like t-tests, chi-square tests, and p-values to determine the statistical significance of results.
Statistical data analysis is crucial for making evidence-based decisions and for understanding the underlying
patterns and relationships in data.

Q. How R is Useful in Data Analysis and Statistical Analysis of Data?


Answer :
➢ R is a powerful programming language and software environment specifically designed for statistical
computing and data analysis.
➢ It is widely used by statisticians, data analysts, and researchers for its robust capabilities in handling
and analyzing data. Here’s how R is useful in data analysis and statistical analysis:

1. Comprehensive Statistical Techniques:


R provides a wide array of built-in statistical functions and libraries for performing descriptive statistics,
hypothesis testing, regression analysis, and other advanced statistical modeling techniques.

2|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

R’s packages like `stats`, `MASS`, `ggplot2`, and `lme4` offer tools for virtually every type of statistical
analysis.

2. Data Manipulation and Cleaning:


R’s data manipulation capabilities, especially with packages like `dplyr`, `tidyr`, and `data.table`, allow
users to clean, filter, transform, and aggregate data efficiently.
Functions like `subset()`, `merge()`, and `apply()` are used for organizing and preprocessing data.

3. Data Visualization:
R excels in data visualization, with packages like `ggplot2`, `plotly`, and `lattice` providing tools to create
complex and publication-quality plots.
Visualizations like histograms, scatter plots, bar charts, box plots, and heatmaps help in exploring data
trends and relationships.

4. Reproducibility:
R scripts and notebooks (e.g., R Markdown) allow analysts to document their analysis process, making it
easier to reproduce and share results.
This is crucial in research and business environments where consistent and transparent analysis is
required.

5. Handling Large Datasets:


R is capable of handling and analyzing large datasets efficiently, thanks to packages like `data.table` and
`ff` which are designed for high-performance data processing.

6. Extensibility:
R has a vast ecosystem of packages (over 18,000 on CRAN) that extend its capabilities in data analysis.
Whether you need specialized statistical tests, machine learning models, or bioinformatics tools, R has a
package for it.
The community-driven nature of R ensures continuous updates and the availability of cutting-edge
techniques.

7. Integration with Other Tools:


R can easily integrate with other programming languages (like Python), databases (SQL), and data analysis
tools (Excel, Power BI).
This makes it a versatile tool that fits into various stages of the data analysis workflow.

8. Open Source and Free:


R is open-source software, meaning it’s free to use, modify, and distribute. This has led to its widespread
adoption in academia, research, and industry.

3|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Example:

Consider analyzing the `marks.csv` dataset using R. You could start by importing the data, cleaning it (e.g.,
handling missing values), performing descriptive statistics (e.g., finding the mean and standard deviation of
students' marks), and visualizing the distribution of marks using histograms. Then, you could build a linear
regression model to predict student performance based on their marks in differen t subjects. All of this can
be done efficiently using R’s extensive statistical and data manipulation capabilities.

In summary, R is a versatile and powerful tool for both data analysis and statistical analysis, offering a
comprehensive suite of tools for data manipulation, statistical modeling, and data visualization.

Q. Explain Basic commands of R.


Answer:
R is a powerful programming language used for statistical computing and data analysis. Here's an overview
of the basic R syntax to help you get started:
1. Comments

Use `#` to add comments in your code. Anything after `#` on a line will be ignored by R.
```r
# This is a comment
```
2. Variables and Assignment

Assign values to variables using the `<-` operator or the `=` operator.
```r
x <-10 # Assign 10 to x
y = 5 # Assign 5 to y
```
3. Basic Data Types
Numeric: Represents numbers (e.g., integers, floating-point numbers).
```r
num <-42 # Numeric
```
Character: Represents text strings.
```r
str <-"Hello" # Character
```

4|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Logical: Represents boolean values `TRUE` or `FALSE`.


```r
flag <-TRUE # Logical
```
4. Vectors

Vectors are a basic data structure in R, which hold elements of the same type.
Create a vector using the `c()` function.
```r
vec <-c(1, 2, 3, 4, 5) # Numeric vector
names <-c("Alice", "Bob", "Charlie") # Character vector
```
5. Operations on Vectors

Perform arithmetic operations element-wise.

```r
vec2 <vec * 2 # Multiplies each element by 2
sum_vec <-vec + vec2 # Adds corresponding elements
```
6. Sequences and Repetitions
Create sequences using `:` or the `seq()` function.
```r
seq1 <-1:10 # Sequence from 1 to 10
seq2 <-seq(1, 10, by=2) # Sequence from 1 to 10 with step 2
```
Repeat elements using the `rep()` function.
```r
rep_vec <-rep(1:3, times=3) # Repeats 1, 2, 3 three times
```
7. Matrices
Matrices are two-dimensional arrays that contain elements of the same type.
```r
mat <-matrix(1:9, nrow=3, ncol=3) # 3x3 matrix
```
8. Lists

Lists can contain elements of different types.


```r
lst <-list(name="Rutvik", age=25, scores=c(90, 85, 92))
```

5|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

9. Data Frames
Data frames are used to store tabular data, with rows representing observations and columns representing
variables.
df <-data.frame(
name = c("Rutvik", "Pankti", "Evaan"),
age = c(24, 19, 8),
score = c(90, 85, 92)
)
10. Accessing Elements
Use brackets `[]` to access elements in vectors, matrices, and data frames.
```r
vec[1] # First element of vector
mat[1,2] # Element in the first row, second column of matrix
df$name # Access the 'name' column of the data frame
df[1, "age"] # Access the 'age' value of the first row
```
11. Conditional Statements

Use `if`, `else if`, and `else` to control the flow of your program.
```r
x <-10
if (x > 0) {
print("Positive")
} else if (x == 0) {
print("Zero")
} else {
print("Negative")
}
```
12. Loops
Use `for` and `while` loops for repetitive tasks.
# for loop
for (i in 1:5) {
print(i)
}
# while loop
count <-1
while (count <= 5) {
print(count)
count <count + 1
}

6|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

13. Functions
Define functions using the `function` keyword.
```r
add <-function(a, b) {
return(a + b)
}
result <-add(3, 4) # Calls the function with arguments 3 and 4
```
14. Packages

Install and load packages using `install.packages()` and `library()`.


```r
install.packages("ggplot2") # Install ggplot2 package
library(ggplot2) # Load ggplot2 package
```
15. Basic Plotting
R has built-in plotting capabilities.
```r
plot(vec) # Plot the vector
hist(vec) # Create a histogram
```

7|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Q. How to use cut() function? Explain with example.


Answer:
The `cut` function in R is used to divide continuous data into intervals or "bins" and is particularly useful
for converting numeric data into categorical data.
It creates factors from numeric data by segmenting the range of data into intervals.
Syntax
```r
cut(x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE, dig.lab = 3, ordered_result = FALSE)
```
Arguments

`x`: A numeric vector that you want to segment into intervals.


`breaks`: Specifies how to cut the data. It can be:

A numeric vector of two or more unique cut points.

A single number (number of intervals or bins).


`labels`: Optional. The labels for the resulting intervals. If not provided, the intervals will be labeled with the
range of values.
`include.lowest`: Logical, indicating if the lowest (or highest, for right = FALSE) value should be included in
the first (or last) interval.
`right`: Logical, indicating if the intervals should be closed on the right (and open on the left) or closed on
the left (and open on the right).
`dig.lab`: An integer which is used when labels are not given, to determine the number of digits used in
interval labels.
`ordered_result`: Logical, if `TRUE`, the result is an ordered factor.

Examples
1. Basic Use

Suppose a vector of numeric values representing ages.

Categorize these into age groups:


```r
ages <-c(23, 45, 18, 67, 34, 50, 29, 40)
# Divide ages into 3 bins
age_groups <-cut(ages, breaks=3)
print(age_groups)
```

8|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Output:
```
[1] (18,36.3] (36.3,54.7] (18,36.3] (54.7,73] (18,36.3] (36.3,54.7] (18,36.3] (36.3,54.7]
Levels: (18,36.3] (36.3,54.7] (54.7,73]
```
Here, `cut` divides the data into 3 intervals and assigns each age to one of these intervals.
2. Specifying Breakpoints

We can also specify the exact breakpoints:


```r
# Specify custom breakpoints
age_groups <-cut(ages, breaks=c(0, 18, 35, 50, 100))
print(age_groups)
```
Output:
```
[1] (18,35] (35,50] (0,18] (50,100] (18,35] (35,50] (18,35] (35,50]
Levels: (0,18] (18,35] (35,50] (50,100]
```
This divides the ages into intervals: 0-18, 19-35, 36-50, and 51-100.
3. Adding Labels

You can label the intervals for clarity:


```r
ages <-c(23, 45, 18, 67, 34, 50, 29, 40, 18, 90)
age_groups <-cut(ages, breaks=c(0, 18, 35, 50, 100), labels=c("Child", "Young",
"Adult", "Senior"))
print(age_groups)

```
Output:
[1] Young Adult Child Senior Young Adult Young Adult Child Senior
Levels: Child Young Adult Senior```

4. Including the Lowest Value

If you want to include the lowest value in the first interval:


```r
age_groups <-cut(ages, breaks=c(18, 35, 50, 100), include.lowest=TRUE)
print(age_groups)
```

9|Page
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

5. Closed Intervals
By default, the intervals are closed on the right, meaning that the interval `(a,b]` includes `b` but not `a`. To
close the interval on the left instead, set `right = FALSE`:
```r
age_groups <-cut(ages, breaks=c(18, 35, 50, 100), right=FALSE)
print(age_groups)
```

Use Case
The `cut` function is often used in data preprocessing to transform continuous variables into categorical
variables.
For example, if you're working with age data and want to categorize it into age groups (e.g., Child, Teen,
Adult), `cut` is a convenient way to achieve this.

Q. Create a CSV containing around 20 records using R/Python code.


Answer :
Python-Code :

import pandas as pd

# Creating a DataFrame with twenty student records

data = {

"student_id": [f"S{i:02}" for i in range(1, 21)],

"student_name": [

"John Doe", "Jane Smith", "Michael Brown", "Emily Davis", "Chris Johnson",

"Sarah Lee", "Matthew Taylor", "Jessica Wilson", "David White", "Laura Moore",

"James Harris", "Sophia Thompson", "Daniel Martin", "Olivia Clark", "Henry Lewis",

"Emma Walker", "Liam Hall", "Ava Young", "Mason Allen", "Isabella King"

],

"subject_1": [78, 85, 90, 95, 82, 88, 76, 91, 84, 87, 89, 93, 81, 77, 92, 86, 79, 80, 83, 94],

"subject_2": [88, 79, 92, 81, 94, 85, 87, 90, 77, 82, 91, 84, 80, 89, 83, 86, 93, 78, 92, 88],

"subject_3": [90, 88, 85, 92, 81, 87, 89, 84, 93, 91, 86, 79, 88, 80, 77, 82, 94, 92, 79, 83],

# Calculate total, percentage, and class

df = pd.DataFrame(data)

10 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

df["total"] = df[["subject_1", "subject_2", "subject_3"]].sum(axis=1)

df["percentage"] = df["total"] / 3

df["class"] = pd.cut(df["percentage"], bins=[0, 50, 60, 70, 80, 90, 100], labels=["F", "D", "C", "B", "A", "A+"])

# Save the DataFrame to a CSV file

file_path = "/mnt/data/student_records.csv"

df.to_csv(file_path, index=False)

file_path

RCode :
# Create data

student_id <-sprintf("S%02d", 1:20)

student_name <-c("Rakesh", "Ramesh", "Mahesh", "Ritesh", "Rahul",

"Sarah", "Pankti", "Manish", "Yatin", "Nirav", "Darshna", "Rachna", "Premal",

"Rutvik", "Evaan", "Divyesh", "Bhargavi", "Jignesh",

"Mitesh", "Payal")

subject_1 <-c(78, 85, 90, 95, 82, 88, 76, 91, 84, 87, 89, 93, 81, 77, 92, 86, 79, 80, 83, 94)

subject_2 <-c(88, 79, 92, 81, 94, 85, 87, 90, 77, 82, 91, 84, 80, 89, 83, 86, 93, 78, 92, 88)

subject_3 <-c(90, 88, 85, 92, 81, 87, 89, 84, 93, 91, 86, 79, 88, 80, 77, 82, 94, 92, 79, 83)

# Calculate total, percentage, and class

total <-subject_1 + subject_2 + subject_3

percentage <- total / 3

class <- cut(percentage, breaks=c(0, 50, 60, 70, 80, 90, 100),

labels=c("F", "D", "C", "B", "A", "A+"))

# Create a data frame

students_df <- data.frame(student_id, student_name, subject_1, subject_2, subject_3, total, percentage, class)

# Write to CSV

write.csv(students_df, file="marks.csv", row.names=FALSE)

# List all files and folders in the /content directory


list.files("/content")

11 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Q. Write a R code to read records from the csv file exists in a folder in my
desktop and display the records one by one.
Answer :
# Define the file path

file_path <-"content/marks.csv"

# Read the CSV file into a data frame

students_df <-read.csv(file_path)

# Display the records one by one

for (i in 1:nrow(students_df)) {

print(students_df[i, ])

readline(prompt="Press [Enter] to see the next record")

Q. Write a R code to write records in the existing csv file exists in a folder in my
desktop and display total records available in the csv file after adding new
record in it.
Answer :
# Define the file path

file_path <"/content/marks.csv"

# Read the existing CSV file into a data frame

students_df <read.csv(file_path)

# Create a new record

new_record <data.frame(
student_id = "S21",
student_name = "Lily Scott",
subject_1 = 85,
subject_2 = 90,
subject_3 = 88,
total = 85 + 90 + 88,
percentage = (85 + 90 + 88) / 3,
class = cut((85 + 90 + 88) / 3, breaks=c(0, 50, 60, 70, 80, 90, 100), labels=c("F", "D", "C", "B", "A", "A+"))
)

12 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

# Append the new record to the existing data frame

students_df <rbind(students_df, new_record)

# Write the updated data frame back to the CSV file

write.csv(students_df, file=file_path, row.names=FALSE)

# Display the total number of records

cat("Total number of records in the CSV file after adding the new record:", nrow(students_df), "\n")

Q. Explain Data Filtering and Data Cleaning giving appropriate example.


Answer:
Data filtering and cleaning are crucial steps in the data pre-processing pipeline.

These steps help ensure that the dataset you're working with is accurate, consistent, and relevant, which ultimately
leads to better analysis and more reliable results.

Let's go through each step in detail with an example.

Step 1: Understand the Dataset


Before starting with data filtering and cleaning, it's important to understand the structure, content, and context of
the dataset.

Example:

Suppose we have a dataset of customer orders from an e-commerce platform. The dataset includes the following
columns:

`OrderID`: Unique identifier for each order

`CustomerID`: Unique identifier for each customer

`OrderDate`: Date of the order

`ProductID`: Unique identifier for each product

`Quantity`: Number of units ordered

`Price`: Price per unit

`TotalAmount`: Total amount for the order (Quantity * Price)

`Status`: Status of the order (e.g., Completed, Pending, Canceled)

Step 2: Load the Dataset


Load the dataset into your environment using the appropriate method.

# Load dataset in R
orders <read.csv("orders.csv")

13 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Step 3: Initial Data Exploration


Explore the dataset to get an overview of its structure, identify any obvious issues, and understand the range and
distribution of the data.

```r
# View the first few rows of the dataset
head(orders)
# Check the structure of the dataset
str(orders)
# Summary statistics for numerical columns
summary(orders)
```

Step 4: Handling Missing Data


Missing data can lead to biased or incorrect analysis. You need to identify missing values and decide how to handle
them.

```r
# Check for missing values
sum(is.na(orders))

# View which columns have missing data


colSums(is.na(orders))
```
Handling Missing Data:

Remove Rows/Columns: If missing values are few and scattered, you can remove the affected rows or columns.

Impute Missing Values: Replace missing values with the mean, median, mode, or other relevant values.

```r
# Remove rows with missing values
orders_clean <na.omit(orders)
# Impute missing values in 'Quantity' with median
orders$Quantity[is.na(orders$Quantity)] <median(orders$Quantity, na.rm = TRUE)
```
Step 5: Correcting Data Types
Ensure that each column has the correct data type (e.g., numeric, factor, date).

```r
# Convert 'OrderDate' to Date type
orders$OrderDate <as.Date(orders$OrderDate, format="%Y-%m-%d")

# Convert 'Status' to a factor


orders$Status <as.factor(orders$Status)
```

14 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Step 6: Removing Duplicates


Check for and remove any duplicate records to avoid skewing your analysis.

```r
# Check for duplicates based on all columns
sum(duplicated(orders))
# Remove duplicate rows
orders_clean <orders[!duplicated(orders), ]
```
Step 7: Filtering Data
Filtering involves selecting a subset of data that is relevant for your analysis.

Example 1:

Filter out canceled orders since they may not be relevant to the analysis.

```r
# Filter to include only completed and pending orders
orders_clean <subset(orders_clean, Status != "Canceled")
```
Example 2:

Filter data within a specific date range.

```r
# Filter orders from 2023 only
orders_2023 <subset(orders_clean, OrderDate >= "2023-01-01" & OrderDate <= "2023-12-31")
```
Step 8: Correcting Data Inconsistencies
Look for inconsistencies, such as incorrect or misspelled entries, and correct them.

```r
# Check for inconsistent entries in 'Status'
table(orders_clean$Status)

# Standardize the status entries


orders_clean$Status <recode(orders_clean$Status,
'Complete'='Completed',
'Pending'='Pending',
'Cancelled'='Canceled')
```

15 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Step 9: Handling Outliers


Outliers are data points that significantly differ from other observations. They can distort analysis results if not
handled properly.

```r
# Identify outliers in 'TotalAmount' using IQR method
Q1 <quantile(orders_clean$TotalAmount, 0.25)
Q3 <quantile(orders_clean$TotalAmount, 0.75)
IQR <Q3 Q1
# Define lower and upper bounds
lower_bound <Q1 1.5 * IQR
upper_bound <Q3 + 1.5 * IQR
# Filter out the outliers
orders_no_outliers <subset(orders_clean, TotalAmount >= lower_bound & TotalAmount <= upper_bound)
```
Step 10: Recalculating Derived Columns
If your dataset has derived columns (like `TotalAmount`), recalculate them to ensure consistency.

```r
# Recalculate 'TotalAmount'
orders_clean$TotalAmount <orders_clean$Quantity * orders_clean$Price
```
Step 11: Final Data Check
After cleaning, perform a final check to ensure the dataset is ready for analysis.

```r
# Check summary statistics again
summary(orders_clean)
# Validate that there are no more missing values
sum(is.na(orders_clean))
```
Step 12: Saving the Cleaned Data
Save the cleaned dataset for further analysis.

```r
# Save the cleaned dataset to a new file
write.csv(orders_clean, "orders_clean.csv", row.names=FALSE)
```
Summary of Steps:
1. Understand the Dataset: Know what data you have.

2. Load the Dataset: Import the data.

3. Initial Data Exploration: Explore the structure and summary statistics.

4. Handling Missing Data: Identify and impute or remove missing values.

5. Correcting Data Types: Ensure correct data types for each column.

16 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

6. Removing Duplicates: Remove any duplicate records.

7. Filtering Data: Select relevant subsets of the data.

8. Correcting Data Inconsistencies: Standardize inconsistent data.

9. Handling Outliers: Identify and treat outliers.

10. Recalculating Derived Columns: Ensure derived columns are accurate.

11. Final Data Check: Verify the data is clean and consistent.

12. Saving the Cleaned Data: Save the clean dataset.

Following these steps ensures your data is clean, accurate, and ready for analysis.

17 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Q. Using “marks.csv”, Explain following topics.


(i) How to import data into dataframe from csv and excel file formats in R?
(ii) How to add, remove and rename variables and attributes in R?
(iii) How to subset and filter data in R ?
(iv) What is purpose of cleaning and transforming data in R ? How to clean and
transform data in R ? Give appropriate example and explain it.
(v) What is missing data ? How to identify and handle missing data using R ?
(vi) How to perform data type conversion and recoding ?
Answers:
(i) How to Import Data into DataFrame from CSV and Excel File Formats in R?
Importing Data from CSV:
```r
# Importing data from a CSV file
file_path <"/content/marks.csv"
marks_df <read.csv(file_path)
# Display the first few rows of the data frame
head(marks_df)
```
Importing Data from Excel:
Install the `readxl` package to read Excel files.

```r
# Install and load the readxl package
install.packages("readxl")
library(readxl)

# Importing data from an Excel file


excel_file_path <"/content/marks.xlsx"
marks_df_excel <read_excel(excel_file_path)

# Display the first few rows of the data frame


head(marks_df_excel)
```

18 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

(ii) How to Add, Remove, and Rename Variables and Attributes in R?


Adding a New Variable:
```r
# Adding a new variable 'total_marks' by summing up marks of all subjects
marks_df$total_marks <marks_df$subject_1 + marks_df$subject_2 + marks_df$subject_3
head(marks_df)
```
Removing a Variable:
```r
# Removing the 'subject_1' column
marks_df$subject_1 <NULL
head(marks_df)
```
Renaming a Variable:
Use the `dplyr` package to rename columns.

```r
# Install and load the dplyr package
install.packages("dplyr")
library(dplyr)

# Renaming the 'subject_2' column to 'Mathematics'


marks_df <rename(marks_df, Mathematics = subject_2)
head(marks_df)
```
(iii) How to Subset and Filter Data in R?
Subsetting Data:
```r
# Subsetting data for students with total_marks greater than 250
subset_df <subset(marks_df, total_marks > 250)
print(subset_df)
```
Filtering Data:
```r
# Filtering students who scored more than 90 in subject_3
high_scorers <marks_df[marks_df$subject_3 > 90, ]
print(high_scorers)
```
(iv) What is the Purpose of Cleaning and Transforming Data in R? How to Clean and
Transform Data in R?

19 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Purpose of Cleaning and Transforming Data:


Cleaning: Removing inconsistencies, fixing missing values, and ensuring data quality.

Transforming: Converting data into a suitable format or scale for analysis.

Cleaning and Transforming Data Example:


```r
# Cleaning: Replacing any missing values in 'subject_3' with the mean of the column
marks_df$subject_3[is.na(marks_df$subject_3)] <mean(marks_df$subject_3, na.rm = TRUE)
# Transforming: Creating a new variable 'average_marks' from existing variables
marks_df$average_marks <marks_df$total_marks / 3
head(marks_df)
```
(v) What is Missing Data? How to Identify and Handle Missing Data Using R?
Missing Data:
Missing Data: When some values are absent in a dataset, which can affect analysis.

Identifying and Handling Missing Data:

```r
# Identifying missing data
missing_data_summary <colSums(is.na(marks_df))
print(missing_data_summary)

# Handling missing data: Removing rows with any missing data


cleaned_df <na.omit(marks_df)
head(cleaned_df)
```
(vi) How to Perform Data Type Conversion and Recoding?

Data Type Conversion:


```r
# Converting student_id from character to factor
marks_df$student_id <as.factor(marks_df$student_id)
str(marks_df)
```
Recoding Data:
Use the `dplyr` package to recode variables.

# Recoding 'subject_3' marks into categories


marks_df$subject_3_category <cut(marks_df$subject_3,
breaks = c(0, 70, 85, 100),
labels = c("Low", "Medium", "High"))
head(marks_df)

20 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

MCQ and Short Q/A for UNIT-3


1. What is R primarily used for?
A) Web development
B) Game design
C) Data analysis and statistics
D) Mobile app development
Answer: C) Data analysis and statistics

2. Which of the following is a key feature of R?


A) Supports only structured data
B) Complex graphical capabilities for data visualization
C) Only works with small datasets
D) Limited community support
Answer: B) Complex graphical capabilities for data visualization

3. R is an example of which type of programming language?


A) Compiled
B) Low-level
C) Interpreted
D) Machine-level
Answer: C) Interpreted

4. Which of the following industries usually frequently uses R for analysis?


A) Manufacturing
B) Healthcare
C) Agriculture
D) Real estate
Answer: B) Healthcare

5. R is especially known for its use in:


A) Network security
B) Natural language processing
C) Statistical computing
D) Image processing
Answer: C) Statistical computing

6. What is RStudio?
A) A programming language
B) An integrated development environment (IDE) for R
C) A data analysis tool only for Python
D) A web browser extension
Answer: B) An integrated development environment (IDE) for R

7. Which website is commonly used to download R?


A) www.r-project.org
B) www.github.com
C) www.python.org
D) www.stackoverflow.com
Answer: A) www.r-project.org

21 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

8. What package manager does R use to install additional libraries?


A) pip
B) npm
C) CRAN
D) RubyGems
Answer: C) CRAN

9. Which operating system supports R?


A) Windows
B) MacOS
C) Linux
D) All of the above
Answer: D) All of the above

10. After installing R, which command is used to check if R is working in RStudio?


A) `print("Hello, World!")`
B) `echo "Hello"`
C) `RTest()`
D) `system("Hello")`
Answer: A) `print("Hello, World!")`

11. Which symbol is used to assign values to variables in R?


A) `=`
B) `<-`
C) `==`
D) `=>`
Answer: B) `<-`

12. Which of the following is NOT a valid data type in R?


A) Numeric
B) Character
C) Array
D) Logical
Answer: C) Array

13. What will the following R code return? `x <10; class(x)`


A) "character"
B) "logical"
C) "numeric"
D) "integer"
Answer: C) "numeric"

14. Which of the following is used for comments in R?


A) `//`
B) `#`
C) `/* */`
D) `<!--->`
Answer: B) `#`

22 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

15. How do you create a vector of numbers from 1 to 5 in R?


A) `1 : 5`
B) `c(1,5)`
C) `seq(1,5)`
D) `range(1:5)`
Answer: A) `1 : 5`

16. Which function is used to import a CSV file into R?


A) `read.csv()`
B) `import.csv()`
C) `load.csv()`
D) `fetch.csv()`
Answer: A) `read.csv()`

17. Which package is needed to import Excel files into R?


A) ggplot2
B) readxl
C) dplyr
D) xlsxReader
Answer: B) readxl

18. What is the function to import data from an Excel sheet using the `readxl` package?
A) `import_excel()`
B) `excel_read()`
C) `read_excel()`
D) `read_xlsx()`
Answer: C) `read_excel()`

19. If you have a CSV file named "data.csv", how would you read it into R?
A) `read.data("data.csv")`
B) `read.csv("data.csv")`
C) `csv.read("data.csv")`
D) `load.csv("data.csv")`
Answer: B) `read.csv("data.csv")`

20. Which function would you use to save a data frame into a CSV file in R?
A) `write.csv()`
B) `export.csv()`
C) `save.csv()`
D) `store.csv()`
Answer: A) `write.csv()`

21. What is the function to view the first few rows of a data frame in R?
A) `start()`
B) `head()`
C) `top()`
D) `view()`
Answer: B) `head()`

23 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

22. Which function is used to create a data frame in R?


A) `make.dataframe()`
B) `df.create()`
C) `data.frame()`
D) `build.df()`
Answer: C) `data.frame()`

23. How can you display the structure of a data frame in R?


A) `info()`
B) `describe()`
C) `str()`
D) `summary()`
Answer: C) `str()`

24. Which function is used to write a data frame into a CSV file?
A) `save.csv()`
B) `write.csv()`
C) `store.csv()`
D) `write.dataframe()`
Answer: B) `write.csv()`

25. Which of the following is NOT a way to view the contents of a data frame in R?
A) `print()`
B) `summary()`
C) `tail()`
D) `scan()`
Answer: D) `scan()`

24 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

Unit – 3 : Short Q/A


1. What is R used for?
Answer: R is used for statistical computing, data analysis, and creating data visualizations.

2. Is R a compiled or interpreter based language?


Answer: R is an interpreter based language.

3. Which sector frequently uses R for data analysis?


Answer: The healthcare sector frequently uses R for data analysis.

4. What is R known for in terms of data visualization?


Answer: R is known for its powerful data visualization capabilities.

5. What are some applications of R?


Answer: R is used in statistics, data mining, machine learning, and bioinformatics.

6. What is RStudio?
Answer: RStudio is an Integrated Development Environment (IDE) for R.

7. Where can you download R?


Answer: R can be downloaded from the official website, www.r-project.org.

8. What is CRAN?
Answer: CRAN is the Comprehensive R Archive Network, used for installing R packages.

9. Is R supported on all major operating systems?


Answer: Yes, R runs on Windows, MacOS, and Linux.

10. How do you check if R is installed correctly?


Answer: By running the command `print("Hello, World!")` in RStudio.

11. Which symbol is used to assign a value to a variable in R?


Answer: The symbol `<-` is used to assign values in R.

12. What is the data type of a value like `TRUE` in R?


Answer: The data type is `logical`.

13. How do you comment in R?


Answer: Comments in R start with `#`.

14. How do you create a sequence of numbers from 1 to 10 in R?


Answer: You can create it using `1:10`.

15. What does the `class()` function do in R?


Answer: It returns the data type of a variable.

16. Which function is used to import a CSV file into R?


Answer: The `read.csv()` function is used to import CSV files.

25 | P a g e
UNIT-3 : NOTES BY : DR. SNEHAL K JOSHI

17. Which package is commonly used to import Excel files into R?


Answer: The `readxl` package is commonly used for importing Excel files.

18. How do you read an Excel file using the `readxl` package?
Answer: You use the `read_excel()` function.

19. How do you write a data frame into a CSV file in R?


Answer: You use the `write.csv()` function.

20. How can you import a CSV file named `data.csv`?


Answer: You can import it using `read.csv("data.csv")`.

21. Which function is used to view the first few rows of a data frame?
Answer: The `head()` function is used to view the first few rows of a data frame.

22. How do you create a data frame in R?


Answer: You can create a data frame using the `data.frame()` function.

23. What function displays the structure of a data frame?


Answer: The `str()` function displays the structure of a data frame.

24. Which function gives a summary of the data in a data frame?


Answer: The `summary()` function provides a summary of the data in a data frame.

25. How do you view the last few rows of a data frame?
Answer: The `tail()` function is used to view the last few rows of a data frame.

26. How do you write a data frame to a CSV file?


Answer: You can write a data frame to a CSV file using `write.csv()`.

27. How do you view the column names of a data frame?


Answer: You can view the column names using the `colnames()` function.

28. What function would you use to print the entire data frame to the console?
Answer: The `print()` function is used to print the entire data frame.

29. How do you combine two data frames vertically in R?


Answer: You use the `rbind()` function to combine data frames vertically.

30. How can you access a specific column of a data frame?


Answer: You can access a column using the `$` operator, e.g., `df$column_name`.

26 | P a g e

You might also like