0% found this document useful (0 votes)

98 views10 pages

Cross Validation - Notes

Uploaded by

vinothkumar23.04.2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views10 pages

Cross Validation - Notes

Uploaded by

vinothkumar23.04.2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Cross Validation -Notes

Introduction to Cross-Validation

Definition:
Cross-validation is a statistical method used to estimate the skill of machine
learning models. It involves partitioning a dataset into complementary subsets,
training the model on one subset and validating it on the other.
Importance of Cross-Validation:
Generalization: Helps ensure that the model generalizes well to unseen data.
Model Assessment: Provides a better assessment of how the model will perform
in practice.
Prevention of Overfitting: Reduces the likelihood that the model will overfit to the
training data, leading to poor performance on new data.

Overfitting vs. Underfitting

1 / 10
Cross Validation -Notes

Overfitting:
Description: Occurs when a model learns not only the underlying patterns but
also the noise in the training data.
Indicators:
High accuracy on training data.
Low accuracy on validation/test data.
Visual Example: A graph showing a training curve that diverges significantly from
the validation curve.
Consequence: Model fails to perform well on new, unseen data.
Real-World Analogy: Like a student who memorizes answers without
understanding the material.
Underfitting:
Description: Happens when a model is too simple to capture the underlying trend
of the data.
Indicators:
Low accuracy on both training and validation data.
Visual Example: A graph where both training and validation accuracies are low.
Consequence: Model fails to learn from the data.
Real-World Analogy: Like a student who skims through study material, missing
important concepts.
Balancing Act:
The goal is to find the right level of complexity for the model, which may involve:
Regularization: Techniques such as Lasso or Ridge regression to penalize
overly complex models.
Choosing the Right Model: Selecting a model that aligns with the
complexity of the data.
Cross-Validation: Using techniques to evaluate model performance
effectively.
Hyperparameter Tuning: Adjusting parameters to optimize model
performance.

2 / 10
Cross Validation -Notes

What is Cross-Validation?
Definition:
A technique for assessing how the results of a statistical analysis will generalize to
an independent data set. It is primarily used in settings where the goal is
prediction, and one wants to estimate how accurately a predictive model will
perform in practice.
Purpose:
Model Assessment: Provides reliable estimates of model performance on
unseen data.
Model Selection: Helps in determining the best model among several candidates.
Hyperparameter Tuning: Assists in finding the best configuration of model
parameters.
Process of k-Fold Cross-Validation:
1. Dataset Splitting: The dataset is divided into k equally sized folds.
2. Training & Validation:
For each fold, the model is trained on k-1 folds and validated on the
remaining fold.
This process is repeated k times, ensuring each fold serves as validation
exactly once.
3. Performance Measurement:
Calculate and average the performance metrics (like accuracy, F1-score)
from each iteration to obtain a more reliable estimate of the model's
performance.
Benefits:
Reduced Variance: More stable and reliable performance estimates compared to
a single train/test split.
Better Data Utilization: More efficient use of available data, especially in
scenarios with limited data.
Model Robustness: Ensures that models perform well across different subsets of
data.

Types of Cross-Validation
1. k-Fold Cross-Validation:
Description: The dataset is randomly split into k equal-sized folds. Each fold is
used as a validation set while the remaining k-1 folds are used for training.
Benefit: Reduces bias and variance; each instance gets to be in a validation set
exactly once.

3 / 10
Cross Validation -Notes

2. Stratified k-Fold:
Description: Similar to k-fold, but maintains the percentage of samples for each
class in each fold. This is especially important for imbalanced datasets.
Benefit: Preserves class distribution, leading to better performance estimates for
classification tasks.

3. Leave-One-Out Cross-Validation (LOOCV):

Description: A special case of k-fold cross-validation where k equals the number
of instances in the dataset. Each instance is used once as a validation set.
Benefit: Provides a thorough assessment but can be computationally expensive
for large datasets.

4 / 10
Cross Validation -Notes

4. Time Series Cross-Validation:

Description: A technique specifically designed for time series data where the
training set must precede the validation set in time.
Benefit: Preserves the temporal order of data, making it appropriate for
forecasting tasks.

5 / 10
Cross Validation -Notes

5. Group k-Fold:
Description: Ensures that the same group is not represented in both training and
validation sets. Useful in cases where the data is grouped (e.g., multiple
measurements from the same subjects).
Benefit: Prevents data leakage from related observations.

Practical Implementation of Cross-Validation

Introduction:

"Now that we’ve discussed the theory and importance of cross-validation, let’s move on to
the practical side—implementing cross-validation in Python. Python offers robust libraries
like Scikit-learn that make it easy to perform cross-validation and evaluate your machine
learning models."

1. Setting Up the Environment:

6 / 10
Cross Validation -Notes

"First, let’s ensure we have the necessary libraries installed."

Installing Libraries:
“You will need the following libraries: NumPy for numerical operations, Pandas for data
manipulation, Matplotlib for visualization, and Scikit-learn for machine learning. You can
install these libraries using pip if you haven’t done so already.”

pip install numpy pandas matplotlib scikit-learn

2. Loading the Dataset:

"Next, let’s load a dataset to work with."

Using an Example Dataset:

“For demonstration purposes, we’ll use the popular Iris dataset, which is readily
available in Scikit-learn. This dataset consists of 150 samples of iris flowers, with four
features for each sample.”

from sklearn.datasets import load_iris

import pandas as pd

# Load the iris dataset

iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable

3. Implementing k-Fold Cross-Validation:

"Let’s dive into k-Fold cross-validation now."

Importing Necessary Functions:

“We’ll import the KFold class from Scikit-learn, as well as a classifier like
LogisticRegression to fit our model.”

from sklearn.model_selection import KFold

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Setting Up k-Fold:
“Next, we’ll set up our k-Fold cross-validation. Let’s say we want to use 5 folds.”

7 / 10
Cross Validation -Notes

kf = KFold(n_splits=5, shuffle=True, random_state=42)

4. Looping Through the Folds:

"Now, let’s loop through the folds and evaluate our model."

Fitting the Model:

“We will fit our Logistic Regression model on the training set of each fold and evaluate it
on the test set.”

accuracies = []

for train_index, test_index in kf.split(X):

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

accuracies.append(accuracy)

print(f'Accuracies for each fold: {accuracies}')

print(f'Mean accuracy: {sum(accuracies) / len(accuracies)}')

5. Using Stratified k-Fold:

"If we are dealing with classification problems, it’s wise to consider using Stratified k-Fold."

Implementation of Stratified k-Fold:

“Here’s how you can implement Stratified k-Fold in the same way.”

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

stratified_accuracies = []

for train_index, test_index in skf.split(X, y):

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

8 / 10
Cross Validation -Notes

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

stratified_accuracies.append(accuracy)

print(f'Stratified Accuracies for each fold: {stratified_accuracies}')

print(f'Mean stratified accuracy: {sum(stratified_accuracies) /
len(stratified_accuracies)}')

6. Visualizing the Results:

"Lastly, let’s visualize the performance across the folds."

Plotting Accuracies:
“Visualizing the accuracies can provide insight into the model's consistency across
folds. Here’s how you can plot the accuracies using Matplotlib.”

import matplotlib.pyplot as plt

plt.plot(range(1, 6), accuracies, marker='o', label='k-Fold

Accuracies')
plt.plot(range(1, 6), stratified_accuracies, marker='x',
label='Stratified k-Fold Accuracies')
plt.title('Cross-Validation Accuracies')
plt.xlabel('Fold Number')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

(Discuss the importance of visualizing model performance and how it can help diagnose
potential issues.)

Libraries and Tools:

Python's scikit-learn: Offers easy-to-use functions for implementing various
cross-validation techniques.
Example Code Snippet:

from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)

# Initialize model
9 / 10
Cross Validation -Notes

model = RandomForestClassifier(n_estimators=100)

# Perform 5-fold cross-validation

scores = cross_val_score(model, X, y, cv=5)

# Output the average score

print("Average Score:", scores.mean())

Key Considerations:
Choosing the Right Method: Select the appropriate cross-validation technique
based on dataset size, structure, and problem type.
Data Leakage Prevention: Ensure that the training and validation sets do not
overlap to maintain model integrity.
Computational Cost: Be aware of the computational load, especially with
LOOCV or large datasets.

Conclusion and Best Practices

Revisit Key Concepts:
Importance of balancing overfitting and underfitting.
Utilizing cross-validation to improve model performance.
Best Practices:
Always validate your model with cross-validation, especially when tuning
hyperparameters.
Use stratified sampling for classification tasks to ensure a representative sample.
Monitor and interpret performance metrics carefully to guide model adjustments.
Encourage Practice:
Engage students in practical exercises to apply cross-validation methods on
different datasets.
Discuss case studies where cross-validation significantly improved model
performance.

10 / 10

Unit 5 (ML)
No ratings yet
Unit 5 (ML)
25 pages
Cross-Validation for ML Practitioners
No ratings yet
Cross-Validation for ML Practitioners
5 pages
Answer-4 Shreyansh
No ratings yet
Answer-4 Shreyansh
4 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
20 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
Cross Validation Techniques
No ratings yet
Cross Validation Techniques
27 pages
Sklearn Cross-Validation Guide
100% (1)
Sklearn Cross-Validation Guide
9 pages
Cross Validation
No ratings yet
Cross Validation
7 pages
Machine Learning Data Splits Guide
No ratings yet
Machine Learning Data Splits Guide
30 pages
ML-4th Unit
No ratings yet
ML-4th Unit
44 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Cross Validation
No ratings yet
Cross Validation
16 pages
Model Validation Techniques
No ratings yet
Model Validation Techniques
9 pages
Cross-Validation in Machine Learning - Javatpoint
No ratings yet
Cross-Validation in Machine Learning - Javatpoint
8 pages
List Steps in Data Preparation. Give Short Description of Each Step
No ratings yet
List Steps in Data Preparation. Give Short Description of Each Step
20 pages
Unit 2
No ratings yet
Unit 2
28 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
Cross-Validation Techniques Guide
No ratings yet
Cross-Validation Techniques Guide
10 pages
Lecture Note #6 - PEC-CS701E
No ratings yet
Lecture Note #6 - PEC-CS701E
11 pages
Cross Validation Techniques Guide
No ratings yet
Cross Validation Techniques Guide
21 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Cross Validation in Machine Learning
No ratings yet
Cross Validation in Machine Learning
4 pages
Comparison Between Performance of Classifiers
No ratings yet
Comparison Between Performance of Classifiers
5 pages
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
No ratings yet
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
12 pages
Why Is Cross-Validation Needed?
No ratings yet
Why Is Cross-Validation Needed?
24 pages
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
No ratings yet
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
18 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
Unit 04 EDA 02
No ratings yet
Unit 04 EDA 02
7 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
No ratings yet
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
17 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages
Lec - 4
No ratings yet
Lec - 4
43 pages
1 (A) Explain Supervised Learning and Unsupervised Learning
No ratings yet
1 (A) Explain Supervised Learning and Unsupervised Learning
52 pages
Different Types of Cross-Validations in Machine Learning
No ratings yet
Different Types of Cross-Validations in Machine Learning
12 pages
Unit V
No ratings yet
Unit V
16 pages
Unit 2 Part 2 Data Science Final 23june
No ratings yet
Unit 2 Part 2 Data Science Final 23june
39 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Cross Validation
No ratings yet
Cross Validation
4 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
Model Evaluation and Cross-Validation Methods
No ratings yet
Model Evaluation and Cross-Validation Methods
3 pages
AI - Lecture 3
No ratings yet
AI - Lecture 3
50 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
Cross Validation 1
No ratings yet
Cross Validation 1
5 pages
Model Validation
No ratings yet
Model Validation
5 pages
Chap 2 Logistique Regression
No ratings yet
Chap 2 Logistique Regression
32 pages
Cross Validation for ML Models
No ratings yet
Cross Validation for ML Models
6 pages
Unit 9 Model Evaluation
No ratings yet
Unit 9 Model Evaluation
26 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
(B.ing Lanjut) Soal Tryout Tka-1 Prosus Inten
No ratings yet
(B.ing Lanjut) Soal Tryout Tka-1 Prosus Inten
18 pages
Enhancing AI Cybersecurity Among Senior High School Students
No ratings yet
Enhancing AI Cybersecurity Among Senior High School Students
8 pages
12 - C - 25 - CS - 1 - 2 - Computer Science & Engineering
No ratings yet
12 - C - 25 - CS - 1 - 2 - Computer Science & Engineering
33 pages
2
No ratings yet
2
2 pages
AI Tools for Academic Literature Search
No ratings yet
AI Tools for Academic Literature Search
3 pages
the+Social+Dilemma +transcript
No ratings yet
the+Social+Dilemma +transcript
51 pages
A Deep Learning-Based Experiment On Forest
No ratings yet
A Deep Learning-Based Experiment On Forest
13 pages
Advanced ML: Consistency & Algorithms
No ratings yet
Advanced ML: Consistency & Algorithms
3 pages
Prompt Ideas - Marketing and Communications PDF
No ratings yet
Prompt Ideas - Marketing and Communications PDF
7 pages
First Class
No ratings yet
First Class
3 pages
ChatGPT Bypass and Jailbreak
50% (4)
ChatGPT Bypass and Jailbreak
6 pages
Brain Tumor Detection Report
No ratings yet
Brain Tumor Detection Report
28 pages
Unit 5 NNDL
No ratings yet
Unit 5 NNDL
43 pages
IEEE Research Paper 2
No ratings yet
IEEE Research Paper 2
5 pages
Avpn - Msme Sesi 1
No ratings yet
Avpn - Msme Sesi 1
16 pages
AI in MATLAB Manuscript
No ratings yet
AI in MATLAB Manuscript
12 pages
The Role of Artificial Intelligence in Modern Education
No ratings yet
The Role of Artificial Intelligence in Modern Education
1 page
AI and Cyber Security Units1-6 Notes
No ratings yet
AI and Cyber Security Units1-6 Notes
6 pages
Worksheet 1 AI Project Cycle
No ratings yet
Worksheet 1 AI Project Cycle
3 pages
Digital Tool Comparison for Creators
No ratings yet
Digital Tool Comparison for Creators
3 pages
Report On Quadraped Robot's Leg Analysis
No ratings yet
Report On Quadraped Robot's Leg Analysis
58 pages
Ai Strategy Workshop HVC
No ratings yet
Ai Strategy Workshop HVC
2 pages
A Deep Neuro-Fuzzy Network For Image Classification
No ratings yet
A Deep Neuro-Fuzzy Network For Image Classification
10 pages
Technology Applications in Airlines
50% (2)
Technology Applications in Airlines
29 pages
Ing TK Lanjut 118-123
No ratings yet
Ing TK Lanjut 118-123
6 pages
Racing Physics
No ratings yet
Racing Physics
7 pages
Christina Varnadore
No ratings yet
Christina Varnadore
2 pages
The Forrester Wave RPA 2018 UiPath RPA Leader
No ratings yet
The Forrester Wave RPA 2018 UiPath RPA Leader
24 pages
RL-Dynamic Pricing E-Com Report
No ratings yet
RL-Dynamic Pricing E-Com Report
80 pages
AI Lab Part 1
No ratings yet
AI Lab Part 1
13 pages