[go: up one dir, main page]

0% found this document useful (0 votes)
28 views9 pages

Unit 5 New

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

Unit 5 New

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT V

Generalization error, also known as the difference between training and testing error, occurs when a
model performs well on the training data but poorly on new, unseen data.

Types of Generalization Errors:

1. Overfitting: Models are too complex and fit the noise in the training data.
2. Underfitting: Models are too simple and fail to capture important patterns.
Causes of Generalization Errors:
1. Insufficient training data
2. Model complexity
3. Noise in training data
4. Over-optimization
5. Poor feature selection

Overfitting occurs when a model is too complex and learns the noise in the training data, resulting in
poor performance on unseen data.
Overfitting Leads to
1. High training accuracy
2. Low test accuracy
3. Model complexity (e.g., many features, deep neural networks)
Overfitting Prevention Techniques/ways
1. Regularization (L1, L2, dropout)
2. Cross-validation
3. Early stopping
4. Feature selection
5. Data augmentation
6. Ensemble methods (bagging, boosting)

Examples:

1. Image Classification: A neural network with 1000 layers and 1000 features is trained on a dataset of
100 images. The model achieves 99% accuracy on the training data but only 50% accuracy on the test
data.

2. Stock Market Prediction: A linear regression model with 100 features is trained on a dataset of 1000
stock prices. The model achieves a high R-squared value on the training data but performs poorly on
unseen data.

Cross-validation is a statistical technique used to evaluate the performance of a machine learning model
by training and testing it on multiple subsets of the data. An alternative to predefined separate training
and testing data to validate generalizability is the cross-validation technique, sometimes called rotation
estimation.

There are different variations of cross-validation used in practice:


1. Holdout method

The simplest kind of cross-validation is the holdout method. In this case, the sample is separated into
two disjoint sets, called the training set and the testing set. A model is built using the training set only.
Then this model is asked to predict the output values for the test. The errors in predictions are
accumulated. Which is often called mean absolute test set error and this error serves as an evaluation
measure of the model.

2. K-Fold Cross-Validation

In this case, the dataset is divided into k subsets. The evaluation is done k number of times. Each time,
one of the k slices is kept aside for testing and the other k − 1 subsets put together to be used as the
training set. Thus, it can be seen as the holdout method being repeated k times here. The average error
across all k trials is computed, which is the overall accuracy of the model.

The advantage of k-fold cross-validation is that the model is less biased from how the data gets divided
between training and test sets. Every data point gets a chance to be in a test set exactly once and in the
training set k − 1 times. Therefore, the variance of the resulting estimate is reduced as k is increased.

The disadvantage of this over the holdout method is that for k-fold cross-validation the training
algorithm has to be run k times compared to just once in the holdout method, and it takes k times as
much computation to make an evaluation.

3. Leave-One-Out Cross-Validation (LOOCV)

Leave-one-out cross-validation is another logical extreme variation of k-fold cross validation, where k is
equal to the number of data points (N) in the set. As a result, the function approximator is used to train
on all the data except for one point, which is kept aside for testing, and this process is repeated for each
data point used exactly once as test case, in total N number of times.

The average error in N evaluations is computed and used as the overall error of the model. The
evaluation provided by leave-one-out cross validation method (LOO-XVE) is good. Low bias since nearly
all data is used for training in each iteration. Suitable for small datasets where we need to maximize the
use of available data for training.

The disadvantage is, it seems very expensive to compute for large datasets since the model must be
trained 𝑛 times. High variance in model evaluation since each fold is tested on only one data point,
leading to potentially high variability in results.

4. Leave-P-Out Cross-Validation (LPOCV)

Train on entire data except p samples, test on those p samples. Repeat for all combinations. The number
of iterations depends on the number of ways p points can be chosen from the dataset of size 𝑛 .This can
become very large for even moderately sized datasets and values of 𝑝.

The advantages is that, it provides a more comprehensive evaluation of the model's performance by
testing on all possible subsets of p points. Disadvantage is computationally more expensive than LOOCV,
especially for large p or large n, as the number of combinations grows exponentially.
5. Stratified K-Fold Cross-Validation

In stratified K-fold cross-validation, the dataset is divided into 'K' number of folds, and the folds are made
by preserving the percentage of samples for each class. For each of the K rounds, a different fold is used
as a validation set (or test set), while the remaining K-1 folds form the training set. The model is trained
on the training set and validated on the validation set. This process is repeated K times, with each of the
K folds used exactly once as the validation data.

After training, the model's performance is evaluated in each iteration, and the results are averaged out to
get a final score. This score is more robust than a simple train-test split, as it ensures that every observation
from the original dataset has the chance to appear in both the training and validation sets, mitigating any
potential bias in the model evaluation due to particular data splits.

Advantages: Stratified K-fold cross-validation is advantageous because it reduces the variance associated
with a single trial of train-test split. By preserving the distribution of classes in each fold, it ensures that
each training and validation set is representative of the overall dataset, which is especially important in
imbalanced datasets where some classes dominate over others.

Underfitting

Definition: Underfitting occurs when a model is too simple and fails to capture the underlying patterns in
the training data.

Symptoms:

1. Low training accuracy

2. Low test accuracy

3. Simple model (e.g., few features, shallow neural networks)

Causes:

1. Too few features

2. Insufficient model complexity

3. Inadequate training data

Consequences:

1. Poor predictive performance

2. Failure to capture important patterns

3. Inadequate insights from the model


Examples:

1. Customer Churn Prediction: A logistic regression model with only 2 features is trained on a dataset of
1000 customer records. The model achieves a low accuracy on both training and test data.

2. Speech Recognition: A shallow neural network with only 1 hidden layer is trained on a dataset of 1000
speech recordings. The model fails to recognize speech patterns accurately.

Prevention and Remedies:

Overfitting:

1. Regularization (L1, L2, dropout)

2. Cross-validation

3. Early stopping

4. Feature selection

5. Data augmentation

6. Ensemble methods (bagging, boosting)

Underfitting:

1. Increase model complexity

2. Add relevant features

3. Increase training data

4. Hyperparameter tuning

5. Feature engineering
Hyper Parameter tuning using Grid Search
Grid Search:

Grid Search is a method for finding the optimal combination of hyperparameters for a machine learning
model. Implementing a grid search is the best way to verify the best hyperparameters for an algorithm.
This means, in the case of complex settings of multiple parameters, that has to run hundreds, if not
thousands, of slightly differently tuned models.

Grid searching is a systematic search method that combines all the possible combinations of the
hyperparameters into individual sets. It’s a time-consuming technique. However, grid searching provides
one of the best ways to optimize a machine learning application that could have many working
combinations, but just a single best one.

Key characteristics:

1. Exhaustive search

2. Pre-defined hyperparameter ranges

3. Evaluates all possible combinations

4. Computational expensive

Hyperparameter Tuning:

Hyperparameter Tuning is the process of adjusting hyperparameters to optimize a model's performance.

Key characteristics:

1. Automated or manual adjustment

2. Goal-oriented (e.g., maximize accuracy)

3. May use various optimization techniques (e.g., Bayesian, gradient-based)

4. Can be computationally efficient or expensive

Grid Search is a systematic approach for hyperparameter optimization in machine learning. It is essential
for finding the best combination of hyperparameters that yields the most effective model performance.

Importance of Grid Search:

1. Systematic Search for Optimal Parameters: Grid Search allows for an exhaustive search over a
specified set of hyperparameters. By evaluating every possible combination within the defined
parameter grid, it ensures that the best parameter set is identified to maximize model
performance.

2. Enhancing Model Accuracy: The performance of a machine learning model heavily depends on
hyperparameters such as learning rate, regularization strength, number of layers, and kernel
type (in SVMs). Proper tuning can significantly improve the model’s accuracy, precision, recall,
and overall robustness.

3. Automated and Reproducible: Grid Search automates the process of hyperparameter selection,
making it more consistent and reproducible compared to manual tuning. This leads to a more
reliable model-building process, especially when scaling up to more complex models.

4. Cross-Validation for Reliable Results: Grid Search often incorporates cross-validation (e.g., k-fold
cross-validation), which ensures that the hyperparameter evaluation is robust against overfitting.
Each combination of hyperparameters is validated across multiple data splits, providing a more
accurate estimate of performance.

Steps to perform Grid Search:

1. Define the Hyperparameter Grid: Specify the hyperparameters and their potential values to
explore. For example:

param_grid = {

'C': [0.1, 1, 10],

'kernel': ['linear', 'rbf'],

'gamma': [1, 0.1, 0.01]

 'C': This is a hyperparameter for Support Vector Machines (SVM). It controls the regularization
strength. Smaller values of C create a wider margin but may allow more misclassifications,
making the model more generalized. Larger values of C create a narrower margin, making the
model try to classify all training examples correctly, which can lead to overfitting.
 'kernel': This parameter specifies the type of kernel function to be used in the algorithm. The
kernel function determines how the input space is transformed.

o 'linear' uses a linear kernel, which is suitable for linearly separable data.

o 'rbf' (Radial Basis Function) is a non-linear kernel that is powerful for capturing complex
relationships.

 'gamma': This is another hyperparameter for SVM, specifically for non-linear kernels like 'rbf'. It
defines the influence of a single training example:

 Higher values of gamma mean the model will consider only points close to the decision
boundary, making it more complex and potentially overfitting.

 Lower values mean that the model will consider points farther away, resulting in a
simpler and smoother decision boundary.

2. Evaluate Each Combination: Train the model using each combination of hyperparameters and
evaluate its performance using a scoring metric (e.g., accuracy, F1-score).
3. Select the Best Combination: Identify the combination that results in the highest performance
metric and use it to train the final model.

Examples:

1. Support Vector Machine (SVM) Optimization: In a classification problem, using an SVM model
with C (regularization parameter) and gamma (kernel coefficient) as hyperparameters, a Grid
Search helps find the best pair for the problem. For instance, if a Grid Search is conducted over C
= [0.1, 1, 10] and gamma = [0.1, 0.01, 0.001], the method systematically tests each combination
(e.g., C=0.1, gamma=0.1, C=1, gamma=0.01, etc.) and selects the combination that achieves the
highest cross-validated performance.

2. Random Forest Parameter Tuning: In Random Forest models, parameters like the number of
trees (n_estimators), maximum depth of trees (max_depth), and the number of features to
consider for the best split (max_features) are often optimized using Grid Search.

For example:

param_grid = {

'n_estimators': [100, 200, 300],

'max_depth': [10, 20, 30],

'max_features': ['auto', 'sqrt']

This grid search approach evaluates different combinations and finds the optimal settings that improve
model performance on validation data.

3. Neural Network Hyperparameter Tuning: For more complex models like neural networks, Grid
Search can optimize hyperparameters such as learning rate, batch size, number of hidden layers,
and number of units per layer. By testing combinations like learning rate=[0.01, 0.001], batch
size=[32, 64, 128], and hidden units=[50, 100], practitioners can observe which settings yield the
most effective training and validation results.

Grid Search used in a Classification example:

from sklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric= 'minkowski', p=2)

The K-neighbors classifier has quite a few hyperparameters that can be set for optimal performance:

 The number of neighbor points to consider in the estimate


 How to weight each of them
 What metric to use for finding the neighbors

Using a range of possible values for all the parameters— exactly 40 in this case:

grid = {'n_neighbors': range (1,11),


'weights': ['uniform', 'distance'], 'p': [1,2]}
print ('Number of tested models: %i'
% np.prod([len(grid[element]) for element in grid]))
score_metric = 'accuracy'

The code multiplies the number of all the parameters and prints the result:

Number of tested models: 40

To set the instructions for the search, need to build a Python dictionary whose keys are the names of the
parameters, and the dictionary’s values are lists of the values that we want to test. For instance, in the
above example records a range of 1 to 10 for the hyperparameter n_neighbors using the range (1,11)
iterator, which produces the sequence of numbers during the grid search.

After being instantiated with the learning algorithm, the search dictionary, the scoring metric, and the
cross-validation folds, the GridSearch class operates with the fit method. Optionally, after the grid search
ended, it refits the model with the best found parameter combination (refit=True), allowing it to
immediately start predicting by using the GridSearch class itself. When the search is completed, it can be
inspected using the best_params_ and best_score_ attributes

Grid Search used in a Regression example:

from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

from xgboost import XGBRegressor

model=XGBRegressor(random_state=42)

#Dictionary of Hyperparameter values to search

search_grid={
GS.fit(X_train,Y_train)

print(GS.best_estimator_)

df=pd.DataFrame(GS.cv_results_)

df=df.sort_values("rank_test_r2")

df.to_csv("GS_best_results.csv")

Practical Considerations:

Computational Cost: Grid Search can be computationally expensive, especially when dealing with large
datasets and complex models. To manage this, techniques like Randomized Search (which samples a
fixed number of parameter combinations) or Bayesian optimization can be used as more efficient
alternatives.

Parallel Computing: Many implementations of Grid Search, such as GridSearchCV in scikit-learn, support
parallel processing, which can help mitigate the time cost by utilizing multiple CPU cores.

You might also like