[go: up one dir, main page]

100% found this document useful (1 vote)
368 views12 pages

Unit V - R Programming Notes

r programming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
368 views12 pages

Unit V - R Programming Notes

r programming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT – V STUDY MATERIAL

Interfacing R with Other Languages


Interfacing R with other programming languages allows you to combine R's statistical
capabilities with the performance or specialized libraries of languages like Python, C++, or
Java. Below are the most common and useful interfaces:
1. Interfacing R with Python using reticulate
 Access powerful Python libraries (e.g., TensorFlow, pandas, NumPy, scikit-learn).
 Seamlessly integrate R and Python workflows.
Installation:
[Link]("reticulate")
Basic Example:
library(reticulate)
# Run Python code from R
py_run_string("import numpy as np; x = [Link]([1, 2, 3]); x = x * 2")
py$x # Access the Python variable x in R
Calling Python Functions:
np <- import("numpy")
np$mean(c(1, 2, 3)) # Calls numpy’s mean
Running a Python script:
py_run_file("[Link]")
2. Interfacing R with C/C++ using Rcpp
 Significant performance improvements for computationally heavy tasks.
 Use C++ functions directly in R.
Installation:
[Link]("Rcpp")
Basic Example:
library(Rcpp)
cppFunction('
int factorial(int n) {
if (n <= 1) return 1;
else return n * factorial(n - 1);
}
')
factorial(5) # Output: 120
Using External C++ Files:
sourceCpp("your_cpp_code.cpp")

3. Interfacing R with Java using rJava


 Use Java libraries directly in R.
 Important for working with enterprise software systems and Java-based APIs.
Installation:
[Link]("rJava")
Basic Example:
library(rJava)
.jinit()
a <- .jnew("java/lang/String", "Hello from Java")
.jcall(a, "I", "length") # Calls the length method
Use Cases:
 Python: Deep learning, image processing, web scraping.
 C/C++: High-performance simulations, numerical methods.
 Java: Working with JVM libraries, interacting with large-scale enterprise apps.

Parallel R
R is traditionally single-threaded, but for large computations or simulations, parallel
computing can drastically improve performance. R offers several packages to support parallel
execution on multiple cores or processors.
1. Why Use Parallel R?
 Faster computation by distributing tasks across CPU cores.
 Efficient for loops, simulations, bootstrapping, and large data processing.
2. Core Packages for Parallelism
a. parallel Package (Built-in)
R’s base package, available by default.
Example: Parallel Apply (parLapply)
library(parallel)
# Detect available cores
cores <- detectCores()
cl <- makeCluster(2) # Use 2 cores
# Parallel version of lapply
result <- parLapply(cl, 1:5, function(x) x^2)
stopCluster(cl)

print(result)
Other Functions:
 parSapply() – parallel version of sapply
 mclapply() – fork-based parallelism (Unix/macOS only)

b. foreach + doParallel: Elegant Parallel Loops


More readable and scalable than parLapply.
Example: Using foreach with %dopar%
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)

result <- foreach(i = 1:5) %dopar% {


i^2
}
stopCluster(cl)

print(result)
Benefit:
 More readable.
 Supports nested loops and conditionals inside.

c. future and furrr: Parallel and Functional


Modern approach using futures for parallelism.
library(future)
library(furrr)

plan(multisession, workers = 2) # Automatically chooses the right strategy


future_map(1:5, ~ .x^2)
Tips for Effective Parallel R
 Avoid using large global variables inside parallel code.
 Always stop the cluster (stopCluster) to free resources.
 Use detectCores() to dynamically scale across different machines.
 Use clusterExport() to share variables between processes.

Basic Statistics in R
R is designed for statistical computing, and it provides extensive built-in functions for
descriptive and inferential statistics.
1. Descriptive Statistics
These summarize and describe features of a dataset.
Measures of Central Tendency
data <- c(2, 4, 6, 8, 10)
 mean(data) # Arithmetic mean
 median(data) # Middle value
Measures of Dispersion
 var(data) # Variance
 sd(data) # Standard deviation
 range(data) # Minimum and maximum
Five-Number Summary
summary(data) # Min, 1st Qu., Median, Mean, 3rd Qu., Max
Frequency Tables
table(c("A", "B", "A", "C", "B", "A"))

2. Graphical Summary
Histograms and Boxplots
hist(data, main = "Histogram", col = "lightblue")
boxplot(data, main = "Boxplot")
Density Plot
plot(density(data), main = "Density Plot")

3. Inferential Statistics
These help draw conclusions from sample data.
One-Sample t-Test
Test if the mean of a sample differs from a given value.
[Link](data, mu = 5)
Two-Sample t-Test
group1 <- c(5, 6, 7, 8)
group2 <- c(7, 8, 9, 10)
[Link](group1, group2)

Paired t-Test
before <- c(10, 12, 14, 16)
after <- c(11, 14, 13, 17)
[Link](before, after, paired = TRUE)

4. Correlation and Covariance


Pearson Correlation
x <- c(1, 2, 3, 4)
y <- c(2, 4, 6, 8)

cor(x, y) # Correlation coefficient


cov(x, y) # Covariance

5. Chi-Square Test
Used for categorical data to test independence.
# Create a contingency table
obs <- matrix(c(20, 30, 50, 100), nrow = 2)
[Link](obs)

6. ANOVA (Analysis of Variance)


To compare means across multiple groups.
data(iris)
anova_result <- aov([Link] ~ Species, data = iris)
summary(anova_result)
Linear Models (LM) & Generalized Linear Models (GLM) in R
1. Linear Models (LM)
Linear models describe the relationship between a continuous response variable and one or
more predictor variables using a linear equation.
Model Form:
Y=β0+β1X1+β2X2+⋯+ ϵ

Basic Linear Model in R


data(mtcars)
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
Interpretation:
 Estimate: Coefficients (slopes)
 Pr(>|t|): p-values for testing if coefficients are significantly different from zero
 R-squared: Goodness of fit
 Residuals: Differences between actual and predicted values
Diagnostic Plots
par(mfrow = c(2, 2))
plot(model)
These help check assumptions:
 Linearity
 Normality of residuals
 Homoscedasticity (equal variance)
 Independence

Model with Interaction Term


lm(mpg ~ wt * hp, data = mtcars)

2. Generalized Linear Models (GLM)


GLMs extend LMs to handle non-normal response distributions using a link function.
Model Structure:
g(E(Y))=β0+β1X1+…
Where g() is the link function, and E(Y) is the expected value.
GLM Syntax in R
glm(formula, family = <distribution>, data = ...)
Common Families and Link Functions:

Family Link Function Use Case

gaussian identity Linear regression

binomial logit Logistic regression (binary)

poisson log Count data

a. Logistic Regression (Binomial Family)


glm_model <- glm(vs ~ mpg + wt, data = mtcars, family = binomial)
summary(glm_model)
 Used when the response variable is binary (0/1).
 vs: engine shape (0 = V-shaped, 1 = straight).

b. Poisson Regression
# Example: modeling counts (hypothetical)
counts <- rpois(100, lambda = 5)
group <- gl(2, 50)

glm(counts ~ group, family = poisson)

Model Evaluation
For Logistic Regression
# Predicted probabilities
predicted_probs <- predict(glm_model, type = "response")

# Confusion matrix
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)
table(predicted_class, mtcars$vs)
When to Use LM vs. GLM?

Situation Use LM Use GLM

Response is continuous ✅ ✅ (gaussian)

Response is binary ❌ ✅ (binomial/logit)

Response is count data ❌ ✅ (poisson/log)

Non-constant variance or skewed ❌ ✅ (use link function)

Non-linear Models in R
Non-linear models are used when the relationship between variables can't be explained well by
a straight line.
Model Form:
y=f(x,θ)+ϵy = f(x, \theta) + \epsilony=f(x,θ)+ϵ
Using nls() for Non-Linear Least Squares
# Simulated exponential growth data
x <- 1:10
y <- 5 * exp(0.4 * x) + rnorm(10, sd = 2)

# Fit non-linear model: y = a * exp(b * x)


model <- nls(y ~ a * exp(b * x), start = list(a = 1, b = 0.1))
summary(model)
Visualizing the Fit
plot(x, y, main = "Non-linear Fit")
lines(x, predict(model), col = "red", lwd = 2)

Time Series and Auto-Correlation in R


Time series analysis involves data points ordered in time, often with the goal of forecasting
future values, identifying trends, or detecting patterns like seasonality and cyclicity.

1. What is a Time Series?


A time series is a sequence of observations taken at successive equally spaced points in time.
Examples: stock prices, temperature data, sales over months.
2. Creating Time Series in R
Use ts() to create a time series object.
data <- c(112, 118, 132, 129, 121, 135, 148, 148, 136, 119)
ts_data <- ts(data, start = c(2020, 1), frequency = 12)
plot(ts_data, main = "Monthly Time Series", ylab = "Value")
 start: starting year and period
 frequency: 12 (monthly), 4 (quarterly), 1 (yearly)

3. Time Series Components


A time series typically has 4 components:
 Trend: Long-term progression
 Seasonality: Regular periodic fluctuations
 Cyclic: Irregular, long-term patterns
 Residual: Random noise
Decomposition Example
decomposed <- decompose(ts_data)
plot(decomposed)

4. Auto-Correlation
Auto-correlation is how current values are related to past values (lags). It's a key concept for
building models like ARIMA.
Auto-Correlation Function (ACF)
acf(ts_data, main = "ACF Plot")
Partial Auto-Correlation Function (PACF)
pacf(ts_data, main = "PACF Plot")

5. ARIMA Model (AutoRegressive Integrated Moving Average)


Best model for univariate forecasting.
Automatic ARIMA
library(forecast)
fit <- [Link](ts_data)
summary(fit)
forecasted <- forecast(fit, h = 12)
plot(forecasted)
 [Link]() selects the best (p,d,q) model
 h = 12 forecasts 12 future time points
6. Stationarity Check
Time series must be stationary for ARIMA.
Augmented Dickey-Fuller Test (ADF Test)
library(tseries)
[Link](ts_data)
 p < 0.05: data is stationary
 If not, you may need to difference the series:
diffed_data <- diff(ts_data)
plot(diffed_data)

7. Seasonal Decomposition Using Loess (STL)


More robust to irregular seasonality.
fit_stl <- stl(ts_data, [Link] = "periodic")
plot(fit_stl)
✅ Applications
 Sales forecasting
 Temperature prediction
 Website traffic analysis
 Financial time series
Clustering in R – Detailed Explanation with Examples
Clustering is an unsupervised learning technique that groups similar data points into clusters.
R offers powerful tools to perform and visualize clustering for both numerical and categorical
data.
1. Types of Clustering Methods

Method Description Suitable For

K-Means Partitions data into K clusters Numerical features

Hierarchical Creates a tree of clusters (dendrogram) Numerical features

DBSCAN Density-based clustering Arbitrary shapes/no.

Model-Based Assumes data comes from a mixture of models Probabilistic methods


2. K-Means Clustering
a. Load Data and Apply Clustering
data(iris)
[Link](123)
kmodel <- kmeans(iris[, 1:4], centers = 3)
print(kmodel)
b. Compare with True Species Labels
table(kmodel$cluster, iris$Species)
c. Visualize Clusters
library(ggplot2)
iris$Cluster <- [Link](kmodel$cluster)
ggplot(iris, aes(x = [Link], y = [Link], color = Cluster)) +
geom_point(size = 3) +
labs(title = "K-Means Clustering on Iris Data")

3. Hierarchical Clustering
a. Compute Distance Matrix and Apply Clustering
d <- dist(iris[, 1:4]) # Euclidean distance
hc <- hclust(d, method = "complete")

plot(hc, labels = iris$Species, main = "Hierarchical Clustering Dendrogram")


b. Cut into Clusters
cutree(hc, k = 3)

4. DBSCAN – Density-Based Clustering


library(dbscan)

# Using scaled Iris data


data <- scale(iris[, 1:4])
db <- dbscan(data, eps = 0.5, minPts = 5)

plot(db, data)
 eps: neighborhood radius
 minPts: minimum points to form a cluster
5. Clustering Evaluation
Silhouette Score (Higher = Better)
library(cluster)

sil <- silhouette(kmodel$cluster, dist(iris[, 1:4]))


plot(sil)

✅ Use Cases of Clustering


 Customer segmentation (marketing)
 Document grouping (text mining)
 Image compression (pixel clustering)
 Anomaly detection (outliers form their own cluster)

You might also like