18.15 - Visualizing Train, Validation and Test Datasets - mp4

svm

Uploaded by

NAKKA PUNEETH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views3 pages

18.15 - Visualizing Train, Validation and Test Datasets - mp4

svm

Uploaded by

NAKKA PUNEETH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

So let's understand how to visualize train cross validation test data sets.

This is very
important. Imagine we have a big data set Dn, which we are splitting randomly into three
parts. We are splitting everything randomly into d train, d cross validation, and d test, right?
So let's assume we are splitting it into three parts. Just for simplicity, let's say 60% of data
goes into dtrain, 20% of data goes into cross validation, another 20% into test. Let's assume
this is what is happening. Let's see. So this simple cross is basically a negative data point, a
negative class data point in d train, right? Whenever I'm drawing this, it means a negative
class data point in d train, right? Similarly, when I put a blue cross like this, it means a
positive data point, a positive data point in d train. The first thing you need to notice is for
all the points whether a given point is in D train, d test, sorry, d cross validation or d test.
We have both the pairs, xi yi for train cross validation and test for all three of them. We have
these pairs because everything is coming from DN, right? In DN, we have Xiyi pairs. So
whether a point belongs to train, cross validation or test, we have the class label and the
data point, right? We do have the class label, which means just the way. So this is the plot of,
let's say training data, right? This is a visualization of training data. Of course. Let's assume
the data is two dimensional so that it's easy for us to understand. One thing you'll first
notice in the training data, this is just the training data, okay? So first we'll take training data
and cross validation data. Then we'll extend to test. It's a very simple extension. So if this is
my training data, remember, I have randomly sampled, I have randomly sampled my data
from DN to create, train and cross validation. Let's just focus on training and cross
validation right now. Okay? So what I'll do now is I will create two more types of points.
When I create a cross with a circle surrounding it, it means it's a negative point from d cross
validation, okay? Similarly, when I create a blue cross, it means a positive point from d cross
validation, okay? If it's a simple cross without a circle surrounding it, it means it's from d
train. If there is a circle surrounding it, it means it's D cross validated. So what we have right
now here is the data from d train. So one thing you will notice quickly, one thing you'll
notice quickly is if you look at this orange region, there are lot of positive points. There are a
lot of negative points here. If you look at this blue region, there are a lot of positive points.
There is only one point here in the training data. Remember, a simple cross is a training
data point, right? Now, if I overlay. If I overlay. If I overlay d cross validation points on d
train points, they will not exactly overlap because it's a random sample. It's not exactly the
same data. If it's random sample. Of course, here you have points from various regions,
right? Your bigger data set has all sorts of points. We pick randomly 60% of them and put
them in d train. Similarly, we pick 20% of them randomly and put them in d cross validate,
right? So now, if you see. If you see, probably my decross validation points will look like this,
okay? You'll find some points here. All these points will be. You'll find lot of negative points
here because you have done random sampling, right? You'll find lot of negative points here,
right? But you could also find a random negative point here from cross validation, right?
Just the way we found a random negative point in d train here, you could find a random
negative point from d cross validate here. Now, similarly. Similarly, if you look at positive
points from d cross validate, they'll all mostly be here. They'll all mostly be here, right?
They'll all most probably because you've done random sampling, right? They'll all look
randomly sampled here, right? That doesn't mean that you can't have a random point here.
One random point can be there anywhere, right? So if you look at this data now, you'll
quickly realize. If you look at this data now, right? If you look at this data, what you'll
quickly realize is we have some outliers here. If you look at this point, this point here, this
point here, this point here is a negative data point from D train, which is there in the blue
region. This is again a positive point from D train, which is there in the wrong region of
things. This one, this one now, and this one are also errors. So when you randomly split
your data, when you randomly split your data, what typically happens is wherever there is
good amount of detraining data, see here, if you notice this region, if you notice this region
of the space, you'll find lot of negative points both from detrain and d cross validate.
Similarly, if you take this region, if you take this region, if you take this region, you'll find lot
of positive points. You'll find lot of positive points both from d train and d cross validate. In
this region, you're finding lot of points from negative points from both d train and d cross
validate. But you'll always find these crazy points. See, you have this crazy point here. This
is a positive point from d cross validate, which is there in the red region. That always
happens. So when you are especially breaking your data randomly, expect to see some of
these crazy behaviors like this, like this, like this, like this. So, geometrically, what you need
to think of is the first point geometrically is d train and d cross validate do not overlap
perfectly. Do not overlap perfectly because they're random samples, right? They need not
overlap perfectly, number one. Number two, if there are many, let's say, positive points
from d train in a region, in a region, right? Then it is highly likely, it is highly likely to find.
To find many positive points, many positive points from d cross validate in that region,
right? So what I mean by this is, in this region, you have seen lot of positive point, lot of
negative points from d train, and hence, you will find lot of negative points from d cross
validate also. Similarly, in this region, you have lot of points from positive class in d train.
Hence, you will find lot of points, again, from d cross validate. But if you look at these
regions, if you look at this region, this region, these smaller regions, you have very few
points, right? So if you have very few points in a region from D train, the probability or the
likelihood you will find the same class points from d cross validate is also low, and those are
called noisy points or outliers, okay? If there are many positive or negative, positive or
negative points, similarly, positive or negative. If there are very few, if there are similarly
the opposite of this, if there are very few positive or negative points in a region from D train,
from D train, then it is very unlikely, it is very unlikely to find positive or negative,
respectively, from points from d cross validate in that region. Such points, such points are
called noise points or erroneous points or outlier points, right? So whatever we have
written here for d train and d cross validate also hold for dtrain and d test, right? The same
logic holds for dtrain and dtest. Similar logic holds for Dtrain and D cross validate. As long
as you're breaking your data randomly, these behaviors will occur all the time, right? So
these three statements, d train, d cross validate, and d test, do not overlap perfectly when
randomly sample. When randomly sampled, they do not overlap perfectly. But there is this
density like approach. If there is a region. If you take this region, this region has lot of points
from positive class friend D train, and hence, you are very likely to find positive points from
d cross validate also in this region. But if you take this region, right, there is only one point
from the negative class in d cross validate. And hence, the probability of finding a similar
negative data point from d train is very, very low. So intuitively, intuitively, this is very
important. Intuitively, right? Intuitively. Intuitively, if you think about it, intuitively, I think I
got the spelling wrong, but that's okay. Intuitively, right? If you think about it, if there is a
region, if there is a region with lot of positive points from D train, lots of points from D train,
right? The probability that you'll find lot of points from d cross validate, you'll find lot of
points from d cross validate in this region, right? So if the density. If the density of positive
points from d train is high, is high, then you are likely to find points from positive points
from d cross validate also. On the other hand, let's take another region, okay? Let's take this
region. Let's assume in this region, we only have one positive point or two positive points,
right? This is very low density, right? Very low density. Because there are very few points,
then you may not even find then you may not find any points. May not find any points from
d cross validate positive class at all. On the other hand, you might find negative points. You
might find points like this. This could always happen if this could happen if this happens,
right? Because let's look at it if this happens, okay, then the density of d trains negative
points is high, which means it's more likely for you to find negative points from d cross
validate, right? That's what it means intuitively. So always remember, when you're breaking
your data into D train, d cross validate dtest randomly. Don't expect them to be perfectly
same. There will always be these outliers which will create havoc.

SSRN 3257420
No ratings yet
SSRN 3257420
198 pages
DSM_MOd_5
No ratings yet
DSM_MOd_5
34 pages
Cross Validation Thesis
100% (4)
Cross Validation Thesis
5 pages
ml_pyq_ans
No ratings yet
ml_pyq_ans
37 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
ADS TT2
No ratings yet
ADS TT2
24 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Cofusion Matrix Cross- Validation
No ratings yet
Cofusion Matrix Cross- Validation
34 pages
MLRD 5
No ratings yet
MLRD 5
20 pages
5.2
No ratings yet
5.2
62 pages
ppt5dl
No ratings yet
ppt5dl
33 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
5.4
No ratings yet
5.4
27 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
DD&ME&KDE&KNN
No ratings yet
DD&ME&KDE&KNN
27 pages
Notes - Unit 3 - Machine Learning Lnctu-bca (Aida) - IV Sem - (1)
No ratings yet
Notes - Unit 3 - Machine Learning Lnctu-bca (Aida) - IV Sem - (1)
19 pages
MLA CT1 - Notes
No ratings yet
MLA CT1 - Notes
17 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages
Module 6
No ratings yet
Module 6
24 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
ADS-Methodology and Data Visualization
No ratings yet
ADS-Methodology and Data Visualization
12 pages
Lec9 - Evaluation - converted
No ratings yet
Lec9 - Evaluation - converted
11 pages
ADS
No ratings yet
ADS
20 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
5 Reasons Why You Should Use Cross-Validation in Your Data Science Projects - by Dima Shulga - Towards Data Science
No ratings yet
5 Reasons Why You Should Use Cross-Validation in Your Data Science Projects - by Dima Shulga - Towards Data Science
18 pages
Unit 2
No ratings yet
Unit 2
28 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
DL UNIT2
No ratings yet
DL UNIT2
22 pages
Assingment On Database
No ratings yet
Assingment On Database
16 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
Lecture Note #6_PEC-CS701E
No ratings yet
Lecture Note #6_PEC-CS701E
11 pages
XIIAIUNITICAPSTONE_PROJECTPARTII
No ratings yet
XIIAIUNITICAPSTONE_PROJECTPARTII
11 pages
Cross Validation
No ratings yet
Cross Validation
7 pages
Cross Validation - Notes
No ratings yet
Cross Validation - Notes
10 pages
K Fold and Other Cross-Validation Techniques
No ratings yet
K Fold and Other Cross-Validation Techniques
10 pages
1S01124
100% (1)
1S01124
774 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
ML 5
No ratings yet
ML 5
14 pages
Maxbox - Starter67 Machine Learning
No ratings yet
Maxbox - Starter67 Machine Learning
7 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Machine Leafning
No ratings yet
Machine Leafning
5 pages
Answer-4 Shreyansh
No ratings yet
Answer-4 Shreyansh
4 pages
chapter 1 capstone project ai class 12
No ratings yet
chapter 1 capstone project ai class 12
5 pages
Marketing 2nd Edition Hunt Test Bank 1
100% (86)
Marketing 2nd Edition Hunt Test Bank 1
81 pages
cross validation
No ratings yet
cross validation
5 pages
Confusion Matrix
No ratings yet
Confusion Matrix
4 pages
5 - Population and Sample
100% (3)
5 - Population and Sample
13 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
Time Series and Stochastic Processes
No ratings yet
Time Series and Stochastic Processes
46 pages
HYPOTHESIS Research Methodology
No ratings yet
HYPOTHESIS Research Methodology
20 pages
Previewpdf
No ratings yet
Previewpdf
49 pages
KAN Pd-01.03 Rev 1. Guide On The Evaluation and Expression
No ratings yet
KAN Pd-01.03 Rev 1. Guide On The Evaluation and Expression
29 pages
Introduction To Agriculture Statistics
100% (1)
Introduction To Agriculture Statistics
98 pages
MGT 351 Group Project NSU
No ratings yet
MGT 351 Group Project NSU
45 pages
Revised Brochure BStat (2016)
No ratings yet
Revised Brochure BStat (2016)
36 pages
Multibeam BORESIGHT CAlibration
No ratings yet
Multibeam BORESIGHT CAlibration
25 pages
Metalearning - A Tutorial: Christophe Giraud-Carrier December 2008
No ratings yet
Metalearning - A Tutorial: Christophe Giraud-Carrier December 2008
45 pages
Handnote On B.Stat 2nd Chapter
No ratings yet
Handnote On B.Stat 2nd Chapter
55 pages
Stat210 Fa17 LNC Unit 2
No ratings yet
Stat210 Fa17 LNC Unit 2
36 pages
Analysis of Variance Anova
No ratings yet
Analysis of Variance Anova
38 pages
18mab204t U2 PDF
No ratings yet
18mab204t U2 PDF
27 pages
Ace Reviewer Lbolytc
No ratings yet
Ace Reviewer Lbolytc
16 pages
Geography Syllabus For A Level Form v-VI - 23.1.2018
No ratings yet
Geography Syllabus For A Level Form v-VI - 23.1.2018
69 pages
Unit 2 (HR 03)
No ratings yet
Unit 2 (HR 03)
19 pages
Predicting Academic Performance of College Freshmen in The Philippines
No ratings yet
Predicting Academic Performance of College Freshmen in The Philippines
15 pages
Nurulia Amanda - 2110116062 - Tugas Statistiki Ekonomi
No ratings yet
Nurulia Amanda - 2110116062 - Tugas Statistiki Ekonomi
12 pages
Business Statistics and Analytics in Decision Making: Module 5: Hypothesis Testing
No ratings yet
Business Statistics and Analytics in Decision Making: Module 5: Hypothesis Testing
21 pages
JURNAL - EVI With Cover Page v2
No ratings yet
JURNAL - EVI With Cover Page v2
10 pages
Convolutional Neural Networks For Facial Expressio
No ratings yet
Convolutional Neural Networks For Facial Expressio
9 pages
tap test 2
No ratings yet
tap test 2
3 pages
Unit 4 Hypothesis Formulation and Sampling
No ratings yet
Unit 4 Hypothesis Formulation and Sampling
5 pages
38.1 - Problem Formulation Movie Reviews - mp4
No ratings yet
38.1 - Problem Formulation Movie Reviews - mp4
5 pages
57.7 - USE, DESCRIBE, SHOW TABLES - mp4
No ratings yet
57.7 - USE, DESCRIBE, SHOW TABLES - mp4
4 pages
18.2 - Data Matrix Notation - mp4
No ratings yet
18.2 - Data Matrix Notation - mp4
3 pages
28.7 - Polynomial Kernel - mp4
No ratings yet
28.7 - Polynomial Kernel - mp4
3 pages
56.11 - PageRank - mp4
No ratings yet
56.11 - PageRank - mp4
3 pages
2.7 - Operators - mp4
No ratings yet
2.7 - Operators - mp4
3 pages
57.10 - ORDER BY - mp4
No ratings yet
57.10 - ORDER BY - mp4
2 pages
2.4 - Comments, Indentation and Statements - mp4
No ratings yet
2.4 - Comments, Indentation and Statements - mp4
2 pages
28.13 - Cases - mp4
No ratings yet
28.13 - Cases - mp4
3 pages
Syllabus: IOM 530: Applied Modern Statistical Learning Techniques
No ratings yet
Syllabus: IOM 530: Applied Modern Statistical Learning Techniques
2 pages
Dpa M.tech
No ratings yet
Dpa M.tech
3 pages
Simulation Exercises-2 - Project Management
No ratings yet
Simulation Exercises-2 - Project Management
3 pages
2.2 - Why Learn Python - mp4
No ratings yet
2.2 - Why Learn Python - mp4
1 page
40 Ways To Speed Up Your Upgrade and Decrease Downtimeb
No ratings yet
40 Ways To Speed Up Your Upgrade and Decrease Downtimeb
28 pages
The Kullback-Leibler Divergence For Univariate Models
No ratings yet
The Kullback-Leibler Divergence For Univariate Models
2 pages
SAT Test Prep!
From Everand
SAT Test Prep!
L Mohan Arun
1/5 (1)