[go: up one dir, main page]

0% found this document useful (0 votes)
13 views3 pages

18.15 - Visualizing Train, Validation and Test Datasets - mp4

svm

Uploaded by

NAKKA PUNEETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

18.15 - Visualizing Train, Validation and Test Datasets - mp4

svm

Uploaded by

NAKKA PUNEETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

So let's understand how to visualize train cross validation test data sets.

This is very
important. Imagine we have a big data set Dn, which we are splitting randomly into three
parts. We are splitting everything randomly into d train, d cross validation, and d test, right?
So let's assume we are splitting it into three parts. Just for simplicity, let's say 60% of data
goes into dtrain, 20% of data goes into cross validation, another 20% into test. Let's assume
this is what is happening. Let's see. So this simple cross is basically a negative data point, a
negative class data point in d train, right? Whenever I'm drawing this, it means a negative
class data point in d train, right? Similarly, when I put a blue cross like this, it means a
positive data point, a positive data point in d train. The first thing you need to notice is for
all the points whether a given point is in D train, d test, sorry, d cross validation or d test.
We have both the pairs, xi yi for train cross validation and test for all three of them. We have
these pairs because everything is coming from DN, right? In DN, we have Xiyi pairs. So
whether a point belongs to train, cross validation or test, we have the class label and the
data point, right? We do have the class label, which means just the way. So this is the plot of,
let's say training data, right? This is a visualization of training data. Of course. Let's assume
the data is two dimensional so that it's easy for us to understand. One thing you'll first
notice in the training data, this is just the training data, okay? So first we'll take training data
and cross validation data. Then we'll extend to test. It's a very simple extension. So if this is
my training data, remember, I have randomly sampled, I have randomly sampled my data
from DN to create, train and cross validation. Let's just focus on training and cross
validation right now. Okay? So what I'll do now is I will create two more types of points.
When I create a cross with a circle surrounding it, it means it's a negative point from d cross
validation, okay? Similarly, when I create a blue cross, it means a positive point from d cross
validation, okay? If it's a simple cross without a circle surrounding it, it means it's from d
train. If there is a circle surrounding it, it means it's D cross validated. So what we have right
now here is the data from d train. So one thing you will notice quickly, one thing you'll
notice quickly is if you look at this orange region, there are lot of positive points. There are a
lot of negative points here. If you look at this blue region, there are a lot of positive points.
There is only one point here in the training data. Remember, a simple cross is a training
data point, right? Now, if I overlay. If I overlay. If I overlay d cross validation points on d
train points, they will not exactly overlap because it's a random sample. It's not exactly the
same data. If it's random sample. Of course, here you have points from various regions,
right? Your bigger data set has all sorts of points. We pick randomly 60% of them and put
them in d train. Similarly, we pick 20% of them randomly and put them in d cross validate,
right? So now, if you see. If you see, probably my decross validation points will look like this,
okay? You'll find some points here. All these points will be. You'll find lot of negative points
here because you have done random sampling, right? You'll find lot of negative points here,
right? But you could also find a random negative point here from cross validation, right?
Just the way we found a random negative point in d train here, you could find a random
negative point from d cross validate here. Now, similarly. Similarly, if you look at positive
points from d cross validate, they'll all mostly be here. They'll all mostly be here, right?
They'll all most probably because you've done random sampling, right? They'll all look
randomly sampled here, right? That doesn't mean that you can't have a random point here.
One random point can be there anywhere, right? So if you look at this data now, you'll
quickly realize. If you look at this data now, right? If you look at this data, what you'll
quickly realize is we have some outliers here. If you look at this point, this point here, this
point here, this point here is a negative data point from D train, which is there in the blue
region. This is again a positive point from D train, which is there in the wrong region of
things. This one, this one now, and this one are also errors. So when you randomly split
your data, when you randomly split your data, what typically happens is wherever there is
good amount of detraining data, see here, if you notice this region, if you notice this region
of the space, you'll find lot of negative points both from detrain and d cross validate.
Similarly, if you take this region, if you take this region, if you take this region, you'll find lot
of positive points. You'll find lot of positive points both from d train and d cross validate. In
this region, you're finding lot of points from negative points from both d train and d cross
validate. But you'll always find these crazy points. See, you have this crazy point here. This
is a positive point from d cross validate, which is there in the red region. That always
happens. So when you are especially breaking your data randomly, expect to see some of
these crazy behaviors like this, like this, like this, like this. So, geometrically, what you need
to think of is the first point geometrically is d train and d cross validate do not overlap
perfectly. Do not overlap perfectly because they're random samples, right? They need not
overlap perfectly, number one. Number two, if there are many, let's say, positive points
from d train in a region, in a region, right? Then it is highly likely, it is highly likely to find.
To find many positive points, many positive points from d cross validate in that region,
right? So what I mean by this is, in this region, you have seen lot of positive point, lot of
negative points from d train, and hence, you will find lot of negative points from d cross
validate also. Similarly, in this region, you have lot of points from positive class in d train.
Hence, you will find lot of points, again, from d cross validate. But if you look at these
regions, if you look at this region, this region, these smaller regions, you have very few
points, right? So if you have very few points in a region from D train, the probability or the
likelihood you will find the same class points from d cross validate is also low, and those are
called noisy points or outliers, okay? If there are many positive or negative, positive or
negative points, similarly, positive or negative. If there are very few, if there are similarly
the opposite of this, if there are very few positive or negative points in a region from D train,
from D train, then it is very unlikely, it is very unlikely to find positive or negative,
respectively, from points from d cross validate in that region. Such points, such points are
called noise points or erroneous points or outlier points, right? So whatever we have
written here for d train and d cross validate also hold for dtrain and d test, right? The same
logic holds for dtrain and dtest. Similar logic holds for Dtrain and D cross validate. As long
as you're breaking your data randomly, these behaviors will occur all the time, right? So
these three statements, d train, d cross validate, and d test, do not overlap perfectly when
randomly sample. When randomly sampled, they do not overlap perfectly. But there is this
density like approach. If there is a region. If you take this region, this region has lot of points
from positive class friend D train, and hence, you are very likely to find positive points from
d cross validate also in this region. But if you take this region, right, there is only one point
from the negative class in d cross validate. And hence, the probability of finding a similar
negative data point from d train is very, very low. So intuitively, intuitively, this is very
important. Intuitively, right? Intuitively. Intuitively, if you think about it, intuitively, I think I
got the spelling wrong, but that's okay. Intuitively, right? If you think about it, if there is a
region, if there is a region with lot of positive points from D train, lots of points from D train,
right? The probability that you'll find lot of points from d cross validate, you'll find lot of
points from d cross validate in this region, right? So if the density. If the density of positive
points from d train is high, is high, then you are likely to find points from positive points
from d cross validate also. On the other hand, let's take another region, okay? Let's take this
region. Let's assume in this region, we only have one positive point or two positive points,
right? This is very low density, right? Very low density. Because there are very few points,
then you may not even find then you may not find any points. May not find any points from
d cross validate positive class at all. On the other hand, you might find negative points. You
might find points like this. This could always happen if this could happen if this happens,
right? Because let's look at it if this happens, okay, then the density of d trains negative
points is high, which means it's more likely for you to find negative points from d cross
validate, right? That's what it means intuitively. So always remember, when you're breaking
your data into D train, d cross validate dtest randomly. Don't expect them to be perfectly
same. There will always be these outliers which will create havoc.

You might also like