[go: up one dir, main page]

0% found this document useful (0 votes)
125 views12 pages

Unit V - R Programming Notes

r programming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views12 pages

Unit V - R Programming Notes

r programming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT – V STUDY MATERIAL

Interfacing R with Other Languages


Interfacing R with other programming languages allows you to combine R's statistical
capabilities with the performance or specialized libraries of languages like Python, C++, or
Java. Below are the most common and useful interfaces:
1. Interfacing R with Python using reticulate
 Access powerful Python libraries (e.g., TensorFlow, pandas, NumPy, scikit-learn).
 Seamlessly integrate R and Python workflows.
Installation:
install.packages("reticulate")
Basic Example:
library(reticulate)
# Run Python code from R
py_run_string("import numpy as np; x = np.array([1, 2, 3]); x = x * 2")
py$x # Access the Python variable x in R
Calling Python Functions:
np <- import("numpy")
np$mean(c(1, 2, 3)) # Calls numpy’s mean
Running a Python script:
py_run_file("script.py")
2. Interfacing R with C/C++ using Rcpp
 Significant performance improvements for computationally heavy tasks.
 Use C++ functions directly in R.
Installation:
install.packages("Rcpp")
Basic Example:
library(Rcpp)
cppFunction('
int factorial(int n) {
if (n <= 1) return 1;
else return n * factorial(n - 1);
}
')
factorial(5) # Output: 120
Using External C++ Files:
sourceCpp("your_cpp_code.cpp")

3. Interfacing R with Java using rJava


 Use Java libraries directly in R.
 Important for working with enterprise software systems and Java-based APIs.
Installation:
install.packages("rJava")
Basic Example:
library(rJava)
.jinit()
a <- .jnew("java/lang/String", "Hello from Java")
.jcall(a, "I", "length") # Calls the length method
Use Cases:
 Python: Deep learning, image processing, web scraping.
 C/C++: High-performance simulations, numerical methods.
 Java: Working with JVM libraries, interacting with large-scale enterprise apps.

Parallel R
R is traditionally single-threaded, but for large computations or simulations, parallel
computing can drastically improve performance. R offers several packages to support parallel
execution on multiple cores or processors.
1. Why Use Parallel R?
 Faster computation by distributing tasks across CPU cores.
 Efficient for loops, simulations, bootstrapping, and large data processing.
2. Core Packages for Parallelism
a. parallel Package (Built-in)
R’s base package, available by default.
Example: Parallel Apply (parLapply)
library(parallel)
# Detect available cores
cores <- detectCores()
cl <- makeCluster(2) # Use 2 cores
# Parallel version of lapply
result <- parLapply(cl, 1:5, function(x) x^2)
stopCluster(cl)

print(result)
Other Functions:
 parSapply() – parallel version of sapply
 mclapply() – fork-based parallelism (Unix/macOS only)

b. foreach + doParallel: Elegant Parallel Loops


More readable and scalable than parLapply.
Example: Using foreach with %dopar%
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)

result <- foreach(i = 1:5) %dopar% {


i^2
}
stopCluster(cl)

print(result)
Benefit:
 More readable.
 Supports nested loops and conditionals inside.

c. future and furrr: Parallel and Functional


Modern approach using futures for parallelism.
library(future)
library(furrr)

plan(multisession, workers = 2) # Automatically chooses the right strategy


future_map(1:5, ~ .x^2)
Tips for Effective Parallel R
 Avoid using large global variables inside parallel code.
 Always stop the cluster (stopCluster) to free resources.
 Use detectCores() to dynamically scale across different machines.
 Use clusterExport() to share variables between processes.

Basic Statistics in R
R is designed for statistical computing, and it provides extensive built-in functions for
descriptive and inferential statistics.
1. Descriptive Statistics
These summarize and describe features of a dataset.
Measures of Central Tendency
data <- c(2, 4, 6, 8, 10)
 mean(data) # Arithmetic mean
 median(data) # Middle value
Measures of Dispersion
 var(data) # Variance
 sd(data) # Standard deviation
 range(data) # Minimum and maximum
Five-Number Summary
summary(data) # Min, 1st Qu., Median, Mean, 3rd Qu., Max
Frequency Tables
table(c("A", "B", "A", "C", "B", "A"))

2. Graphical Summary
Histograms and Boxplots
hist(data, main = "Histogram", col = "lightblue")
boxplot(data, main = "Boxplot")
Density Plot
plot(density(data), main = "Density Plot")

3. Inferential Statistics
These help draw conclusions from sample data.
One-Sample t-Test
Test if the mean of a sample differs from a given value.
t.test(data, mu = 5)
Two-Sample t-Test
group1 <- c(5, 6, 7, 8)
group2 <- c(7, 8, 9, 10)
t.test(group1, group2)

Paired t-Test
before <- c(10, 12, 14, 16)
after <- c(11, 14, 13, 17)
t.test(before, after, paired = TRUE)

4. Correlation and Covariance


Pearson Correlation
x <- c(1, 2, 3, 4)
y <- c(2, 4, 6, 8)

cor(x, y) # Correlation coefficient


cov(x, y) # Covariance

5. Chi-Square Test
Used for categorical data to test independence.
# Create a contingency table
obs <- matrix(c(20, 30, 50, 100), nrow = 2)
chisq.test(obs)

6. ANOVA (Analysis of Variance)


To compare means across multiple groups.
data(iris)
anova_result <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_result)
Linear Models (LM) & Generalized Linear Models (GLM) in R
1. Linear Models (LM)
Linear models describe the relationship between a continuous response variable and one or
more predictor variables using a linear equation.
Model Form:
Y=β0+β1X1+β2X2+⋯+ ϵ

Basic Linear Model in R


data(mtcars)
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
Interpretation:
 Estimate: Coefficients (slopes)
 Pr(>|t|): p-values for testing if coefficients are significantly different from zero
 R-squared: Goodness of fit
 Residuals: Differences between actual and predicted values
Diagnostic Plots
par(mfrow = c(2, 2))
plot(model)
These help check assumptions:
 Linearity
 Normality of residuals
 Homoscedasticity (equal variance)
 Independence

Model with Interaction Term


lm(mpg ~ wt * hp, data = mtcars)

2. Generalized Linear Models (GLM)


GLMs extend LMs to handle non-normal response distributions using a link function.
Model Structure:
g(E(Y))=β0+β1X1+…
Where g() is the link function, and E(Y) is the expected value.
GLM Syntax in R
glm(formula, family = <distribution>, data = ...)
Common Families and Link Functions:

Family Link Function Use Case

gaussian identity Linear regression

binomial logit Logistic regression (binary)

poisson log Count data

a. Logistic Regression (Binomial Family)


glm_model <- glm(vs ~ mpg + wt, data = mtcars, family = binomial)
summary(glm_model)
 Used when the response variable is binary (0/1).
 vs: engine shape (0 = V-shaped, 1 = straight).

b. Poisson Regression
# Example: modeling counts (hypothetical)
counts <- rpois(100, lambda = 5)
group <- gl(2, 50)

glm(counts ~ group, family = poisson)

Model Evaluation
For Logistic Regression
# Predicted probabilities
predicted_probs <- predict(glm_model, type = "response")

# Confusion matrix
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)
table(predicted_class, mtcars$vs)
When to Use LM vs. GLM?

Situation Use LM Use GLM

Response is continuous ✅ ✅ (gaussian)

Response is binary ❌ ✅ (binomial/logit)

Response is count data ❌ ✅ (poisson/log)

Non-constant variance or skewed ❌ ✅ (use link function)

Non-linear Models in R
Non-linear models are used when the relationship between variables can't be explained well by
a straight line.
Model Form:
y=f(x,θ)+ϵy = f(x, \theta) + \epsilony=f(x,θ)+ϵ
Using nls() for Non-Linear Least Squares
# Simulated exponential growth data
x <- 1:10
y <- 5 * exp(0.4 * x) + rnorm(10, sd = 2)

# Fit non-linear model: y = a * exp(b * x)


model <- nls(y ~ a * exp(b * x), start = list(a = 1, b = 0.1))
summary(model)
Visualizing the Fit
plot(x, y, main = "Non-linear Fit")
lines(x, predict(model), col = "red", lwd = 2)

Time Series and Auto-Correlation in R


Time series analysis involves data points ordered in time, often with the goal of forecasting
future values, identifying trends, or detecting patterns like seasonality and cyclicity.

1. What is a Time Series?


A time series is a sequence of observations taken at successive equally spaced points in time.
Examples: stock prices, temperature data, sales over months.
2. Creating Time Series in R
Use ts() to create a time series object.
data <- c(112, 118, 132, 129, 121, 135, 148, 148, 136, 119)
ts_data <- ts(data, start = c(2020, 1), frequency = 12)
plot(ts_data, main = "Monthly Time Series", ylab = "Value")
 start: starting year and period
 frequency: 12 (monthly), 4 (quarterly), 1 (yearly)

3. Time Series Components


A time series typically has 4 components:
 Trend: Long-term progression
 Seasonality: Regular periodic fluctuations
 Cyclic: Irregular, long-term patterns
 Residual: Random noise
Decomposition Example
decomposed <- decompose(ts_data)
plot(decomposed)

4. Auto-Correlation
Auto-correlation is how current values are related to past values (lags). It's a key concept for
building models like ARIMA.
Auto-Correlation Function (ACF)
acf(ts_data, main = "ACF Plot")
Partial Auto-Correlation Function (PACF)
pacf(ts_data, main = "PACF Plot")

5. ARIMA Model (AutoRegressive Integrated Moving Average)


Best model for univariate forecasting.
Automatic ARIMA
library(forecast)
fit <- auto.arima(ts_data)
summary(fit)
forecasted <- forecast(fit, h = 12)
plot(forecasted)
 auto.arima() selects the best (p,d,q) model
 h = 12 forecasts 12 future time points
6. Stationarity Check
Time series must be stationary for ARIMA.
Augmented Dickey-Fuller Test (ADF Test)
library(tseries)
adf.test(ts_data)
 p < 0.05: data is stationary
 If not, you may need to difference the series:
diffed_data <- diff(ts_data)
plot(diffed_data)

7. Seasonal Decomposition Using Loess (STL)


More robust to irregular seasonality.
fit_stl <- stl(ts_data, s.window = "periodic")
plot(fit_stl)
✅ Applications
 Sales forecasting
 Temperature prediction
 Website traffic analysis
 Financial time series
Clustering in R – Detailed Explanation with Examples
Clustering is an unsupervised learning technique that groups similar data points into clusters.
R offers powerful tools to perform and visualize clustering for both numerical and categorical
data.
1. Types of Clustering Methods

Method Description Suitable For

K-Means Partitions data into K clusters Numerical features

Hierarchical Creates a tree of clusters (dendrogram) Numerical features

DBSCAN Density-based clustering Arbitrary shapes/no.

Model-Based Assumes data comes from a mixture of models Probabilistic methods


2. K-Means Clustering
a. Load Data and Apply Clustering
data(iris)
set.seed(123)
kmodel <- kmeans(iris[, 1:4], centers = 3)
print(kmodel)
b. Compare with True Species Labels
table(kmodel$cluster, iris$Species)
c. Visualize Clusters
library(ggplot2)
iris$Cluster <- as.factor(kmodel$cluster)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Cluster)) +
geom_point(size = 3) +
labs(title = "K-Means Clustering on Iris Data")

3. Hierarchical Clustering
a. Compute Distance Matrix and Apply Clustering
d <- dist(iris[, 1:4]) # Euclidean distance
hc <- hclust(d, method = "complete")

plot(hc, labels = iris$Species, main = "Hierarchical Clustering Dendrogram")


b. Cut into Clusters
cutree(hc, k = 3)

4. DBSCAN – Density-Based Clustering


library(dbscan)

# Using scaled Iris data


data <- scale(iris[, 1:4])
db <- dbscan(data, eps = 0.5, minPts = 5)

plot(db, data)
 eps: neighborhood radius
 minPts: minimum points to form a cluster
5. Clustering Evaluation
Silhouette Score (Higher = Better)
library(cluster)

sil <- silhouette(kmodel$cluster, dist(iris[, 1:4]))


plot(sil)

✅ Use Cases of Clustering


 Customer segmentation (marketing)
 Document grouping (text mining)
 Image compression (pixel clustering)
 Anomaly detection (outliers form their own cluster)

You might also like