Classification Models
Logistic Regression:
Explanation:
used for binary classification problems (i.e., response variable is binary (0 or 1)).
It models the probability that an instance belongs to a particular category.
In this case, we are predicting whether a car's miles per gallon (mpg) is above or below the mean value.
The logistic function (sigmoid) is used to map predictions to probabilities.
When to Use:
When the relationship between the predictor variables and the response variable is approximately
linear.
Logistic Regression is chosen when the response variable is categorical, and in this example, it's whether
the mpg is above or below the mean.
Suitable for problems where the outcome is binary, like whether an email is spam or not.
Predictors:
Predictor variables should be numeric or categorical.
# =======================================
# R code: Logistic Regression
# =======================================
Step 1: Load Libraries
library(caret)
library(dplyr)
library(zoo) # used in finding and replacing NA values with mean
Step 2: Load Dataset
data <- mtcars
Step 3: Handle Missing Values, Scaling, and Normalization
# Check for missing values
summary(data)
# If there are missing values:
1) use na.omit() (bad) or 2) replace them with mean or median (BEST)
# Specify pre-processing methods
preprocess_params <- preProcess(data, method = c("mean", "dummy")) # uses mean
preprocess_params <- preProcess(data, method = c("medianImpute", "dummy")) # uses median
# Apply the pre-processing to replace missing values
data <- predict(preprocess_params, newdata = data)
# If scaling or normalization is needed, you can use:
# data <- scale(data) # for scaling
# data <- scale(data, center = FALSE) # for normalization
Step 4: Data Splitting
# Set seed for reproducibility
set.seed(123)
# Split the data into training (80%) and testing (20%) sets
train_index <- createDataPartition(data$mpg, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
Step 5: Build Logistic Regression Model
log_model <- glm(mpg ~., data = train_data, family = "binomial")
Step 6: Model Summary or Plots
# Summary statistics
summary(log_model)
# Or you can create plots if applicable
Step 7: Make Predictions
predictions <- predict(log_model, newdata = test_data, type = "response")
Step 8: Model Evaluation Metrics
# Evaluate model accuracy and performance
conf_matrix <- confusionMatrix(predictions > 0.5, test_data$mpg > mean(data$mpg))
# Display the confusion matrix and other metrics
conf_matrix
=======================
Discriminant Analysis:
Explanation:
Discriminant Analysis is used when there are two or more classes and the goal is to find the linear
combination of features that best separates them.
Assumes normal distribution of predictor variables within each class.
When to Use:
When you have more than two classes and you want to classify new observations into one of them.
Predictors:
Assumes continuous predictors that are normally distributed.
Naive Bayes Classifier:
Explanation:
Naive Bayes is a probabilistic algorithm based on Bayes' theorem, assuming independence between
predictors.
Despite its "naive" assumption, it performs surprisingly well in many real-world situations.
When to Use:
Particularly effective for text classification (spam detection, sentiment analysis).
Predictors:
Works well with both categorical and continuous predictors.
Support Vector Machines (SVM):
Explanation:
SVM is a powerful classification algorithm that finds the hyperplane that best separates data points of
different classes.
It works well in high-dimensional spaces and is effective in cases where the number of dimensions is
greater than the number of samples.
When to Use:
Useful for both linear and non-linear data.
Effective when there is a clear margin of separation between classes.
Predictors:
Works with numeric predictors; it's essential to scale the data for SVM.
Plots:
Logistic Regression and Discriminant Analysis:
Commonly used plots include ROC curves, confusion matrices, and decision boundaries.
SVM:
SVM often involves visualizing decision boundaries in feature space.