UNIT – V STUDY MATERIAL
Interfacing R with Other Languages
Interfacing R with other programming languages allows you to combine R's statistical
capabilities with the performance or specialized libraries of languages like Python, C++, or
Java. Below are the most common and useful interfaces:
1. Interfacing R with Python using reticulate
Access powerful Python libraries (e.g., TensorFlow, pandas, NumPy, scikit-learn).
Seamlessly integrate R and Python workflows.
Installation:
install.packages("reticulate")
Basic Example:
library(reticulate)
# Run Python code from R
py_run_string("import numpy as np; x = np.array([1, 2, 3]); x = x * 2")
py$x # Access the Python variable x in R
Calling Python Functions:
np <- import("numpy")
np$mean(c(1, 2, 3)) # Calls numpy’s mean
Running a Python script:
py_run_file("script.py")
2. Interfacing R with C/C++ using Rcpp
Significant performance improvements for computationally heavy tasks.
Use C++ functions directly in R.
Installation:
install.packages("Rcpp")
Basic Example:
library(Rcpp)
cppFunction('
int factorial(int n) {
if (n <= 1) return 1;
else return n * factorial(n - 1);
}
')
factorial(5) # Output: 120
Using External C++ Files:
sourceCpp("your_cpp_code.cpp")
3. Interfacing R with Java using rJava
Use Java libraries directly in R.
Important for working with enterprise software systems and Java-based APIs.
Installation:
install.packages("rJava")
Basic Example:
library(rJava)
.jinit()
a <- .jnew("java/lang/String", "Hello from Java")
.jcall(a, "I", "length") # Calls the length method
Use Cases:
Python: Deep learning, image processing, web scraping.
C/C++: High-performance simulations, numerical methods.
Java: Working with JVM libraries, interacting with large-scale enterprise apps.
Parallel R
R is traditionally single-threaded, but for large computations or simulations, parallel
computing can drastically improve performance. R offers several packages to support parallel
execution on multiple cores or processors.
1. Why Use Parallel R?
Faster computation by distributing tasks across CPU cores.
Efficient for loops, simulations, bootstrapping, and large data processing.
2. Core Packages for Parallelism
a. parallel Package (Built-in)
R’s base package, available by default.
Example: Parallel Apply (parLapply)
library(parallel)
# Detect available cores
cores <- detectCores()
cl <- makeCluster(2) # Use 2 cores
# Parallel version of lapply
result <- parLapply(cl, 1:5, function(x) x^2)
stopCluster(cl)
print(result)
Other Functions:
parSapply() – parallel version of sapply
mclapply() – fork-based parallelism (Unix/macOS only)
b. foreach + doParallel: Elegant Parallel Loops
More readable and scalable than parLapply.
Example: Using foreach with %dopar%
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
result <- foreach(i = 1:5) %dopar% {
i^2
}
stopCluster(cl)
print(result)
Benefit:
More readable.
Supports nested loops and conditionals inside.
c. future and furrr: Parallel and Functional
Modern approach using futures for parallelism.
library(future)
library(furrr)
plan(multisession, workers = 2) # Automatically chooses the right strategy
future_map(1:5, ~ .x^2)
Tips for Effective Parallel R
Avoid using large global variables inside parallel code.
Always stop the cluster (stopCluster) to free resources.
Use detectCores() to dynamically scale across different machines.
Use clusterExport() to share variables between processes.
Basic Statistics in R
R is designed for statistical computing, and it provides extensive built-in functions for
descriptive and inferential statistics.
1. Descriptive Statistics
These summarize and describe features of a dataset.
Measures of Central Tendency
data <- c(2, 4, 6, 8, 10)
mean(data) # Arithmetic mean
median(data) # Middle value
Measures of Dispersion
var(data) # Variance
sd(data) # Standard deviation
range(data) # Minimum and maximum
Five-Number Summary
summary(data) # Min, 1st Qu., Median, Mean, 3rd Qu., Max
Frequency Tables
table(c("A", "B", "A", "C", "B", "A"))
2. Graphical Summary
Histograms and Boxplots
hist(data, main = "Histogram", col = "lightblue")
boxplot(data, main = "Boxplot")
Density Plot
plot(density(data), main = "Density Plot")
3. Inferential Statistics
These help draw conclusions from sample data.
One-Sample t-Test
Test if the mean of a sample differs from a given value.
t.test(data, mu = 5)
Two-Sample t-Test
group1 <- c(5, 6, 7, 8)
group2 <- c(7, 8, 9, 10)
t.test(group1, group2)
Paired t-Test
before <- c(10, 12, 14, 16)
after <- c(11, 14, 13, 17)
t.test(before, after, paired = TRUE)
4. Correlation and Covariance
Pearson Correlation
x <- c(1, 2, 3, 4)
y <- c(2, 4, 6, 8)
cor(x, y) # Correlation coefficient
cov(x, y) # Covariance
5. Chi-Square Test
Used for categorical data to test independence.
# Create a contingency table
obs <- matrix(c(20, 30, 50, 100), nrow = 2)
chisq.test(obs)
6. ANOVA (Analysis of Variance)
To compare means across multiple groups.
data(iris)
anova_result <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_result)
Linear Models (LM) & Generalized Linear Models (GLM) in R
1. Linear Models (LM)
Linear models describe the relationship between a continuous response variable and one or
more predictor variables using a linear equation.
Model Form:
Y=β0+β1X1+β2X2+⋯+ ϵ
Basic Linear Model in R
data(mtcars)
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
Interpretation:
Estimate: Coefficients (slopes)
Pr(>|t|): p-values for testing if coefficients are significantly different from zero
R-squared: Goodness of fit
Residuals: Differences between actual and predicted values
Diagnostic Plots
par(mfrow = c(2, 2))
plot(model)
These help check assumptions:
Linearity
Normality of residuals
Homoscedasticity (equal variance)
Independence
Model with Interaction Term
lm(mpg ~ wt * hp, data = mtcars)
2. Generalized Linear Models (GLM)
GLMs extend LMs to handle non-normal response distributions using a link function.
Model Structure:
g(E(Y))=β0+β1X1+…
Where g() is the link function, and E(Y) is the expected value.
GLM Syntax in R
glm(formula, family = <distribution>, data = ...)
Common Families and Link Functions:
Family Link Function Use Case
gaussian identity Linear regression
binomial logit Logistic regression (binary)
poisson log Count data
a. Logistic Regression (Binomial Family)
glm_model <- glm(vs ~ mpg + wt, data = mtcars, family = binomial)
summary(glm_model)
Used when the response variable is binary (0/1).
vs: engine shape (0 = V-shaped, 1 = straight).
b. Poisson Regression
# Example: modeling counts (hypothetical)
counts <- rpois(100, lambda = 5)
group <- gl(2, 50)
glm(counts ~ group, family = poisson)
Model Evaluation
For Logistic Regression
# Predicted probabilities
predicted_probs <- predict(glm_model, type = "response")
# Confusion matrix
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)
table(predicted_class, mtcars$vs)
When to Use LM vs. GLM?
Situation Use LM Use GLM
Response is continuous ✅ ✅ (gaussian)
Response is binary ❌ ✅ (binomial/logit)
Response is count data ❌ ✅ (poisson/log)
Non-constant variance or skewed ❌ ✅ (use link function)
Non-linear Models in R
Non-linear models are used when the relationship between variables can't be explained well by
a straight line.
Model Form:
y=f(x,θ)+ϵy = f(x, \theta) + \epsilony=f(x,θ)+ϵ
Using nls() for Non-Linear Least Squares
# Simulated exponential growth data
x <- 1:10
y <- 5 * exp(0.4 * x) + rnorm(10, sd = 2)
# Fit non-linear model: y = a * exp(b * x)
model <- nls(y ~ a * exp(b * x), start = list(a = 1, b = 0.1))
summary(model)
Visualizing the Fit
plot(x, y, main = "Non-linear Fit")
lines(x, predict(model), col = "red", lwd = 2)
Time Series and Auto-Correlation in R
Time series analysis involves data points ordered in time, often with the goal of forecasting
future values, identifying trends, or detecting patterns like seasonality and cyclicity.
1. What is a Time Series?
A time series is a sequence of observations taken at successive equally spaced points in time.
Examples: stock prices, temperature data, sales over months.
2. Creating Time Series in R
Use ts() to create a time series object.
data <- c(112, 118, 132, 129, 121, 135, 148, 148, 136, 119)
ts_data <- ts(data, start = c(2020, 1), frequency = 12)
plot(ts_data, main = "Monthly Time Series", ylab = "Value")
start: starting year and period
frequency: 12 (monthly), 4 (quarterly), 1 (yearly)
3. Time Series Components
A time series typically has 4 components:
Trend: Long-term progression
Seasonality: Regular periodic fluctuations
Cyclic: Irregular, long-term patterns
Residual: Random noise
Decomposition Example
decomposed <- decompose(ts_data)
plot(decomposed)
4. Auto-Correlation
Auto-correlation is how current values are related to past values (lags). It's a key concept for
building models like ARIMA.
Auto-Correlation Function (ACF)
acf(ts_data, main = "ACF Plot")
Partial Auto-Correlation Function (PACF)
pacf(ts_data, main = "PACF Plot")
5. ARIMA Model (AutoRegressive Integrated Moving Average)
Best model for univariate forecasting.
Automatic ARIMA
library(forecast)
fit <- auto.arima(ts_data)
summary(fit)
forecasted <- forecast(fit, h = 12)
plot(forecasted)
auto.arima() selects the best (p,d,q) model
h = 12 forecasts 12 future time points
6. Stationarity Check
Time series must be stationary for ARIMA.
Augmented Dickey-Fuller Test (ADF Test)
library(tseries)
adf.test(ts_data)
p < 0.05: data is stationary
If not, you may need to difference the series:
diffed_data <- diff(ts_data)
plot(diffed_data)
7. Seasonal Decomposition Using Loess (STL)
More robust to irregular seasonality.
fit_stl <- stl(ts_data, s.window = "periodic")
plot(fit_stl)
✅ Applications
Sales forecasting
Temperature prediction
Website traffic analysis
Financial time series
Clustering in R – Detailed Explanation with Examples
Clustering is an unsupervised learning technique that groups similar data points into clusters.
R offers powerful tools to perform and visualize clustering for both numerical and categorical
data.
1. Types of Clustering Methods
Method Description Suitable For
K-Means Partitions data into K clusters Numerical features
Hierarchical Creates a tree of clusters (dendrogram) Numerical features
DBSCAN Density-based clustering Arbitrary shapes/no.
Model-Based Assumes data comes from a mixture of models Probabilistic methods
2. K-Means Clustering
a. Load Data and Apply Clustering
data(iris)
set.seed(123)
kmodel <- kmeans(iris[, 1:4], centers = 3)
print(kmodel)
b. Compare with True Species Labels
table(kmodel$cluster, iris$Species)
c. Visualize Clusters
library(ggplot2)
iris$Cluster <- as.factor(kmodel$cluster)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Cluster)) +
geom_point(size = 3) +
labs(title = "K-Means Clustering on Iris Data")
3. Hierarchical Clustering
a. Compute Distance Matrix and Apply Clustering
d <- dist(iris[, 1:4]) # Euclidean distance
hc <- hclust(d, method = "complete")
plot(hc, labels = iris$Species, main = "Hierarchical Clustering Dendrogram")
b. Cut into Clusters
cutree(hc, k = 3)
4. DBSCAN – Density-Based Clustering
library(dbscan)
# Using scaled Iris data
data <- scale(iris[, 1:4])
db <- dbscan(data, eps = 0.5, minPts = 5)
plot(db, data)
eps: neighborhood radius
minPts: minimum points to form a cluster
5. Clustering Evaluation
Silhouette Score (Higher = Better)
library(cluster)
sil <- silhouette(kmodel$cluster, dist(iris[, 1:4]))
plot(sil)
✅ Use Cases of Clustering
Customer segmentation (marketing)
Document grouping (text mining)
Image compression (pixel clustering)
Anomaly detection (outliers form their own cluster)