Lec 29
Lec 29
Lec 29
Lecture - 29
Big Data Predictive Analytics
(Part-II)
Gradient boosted decision trees for regression.
Boosting: Boosting it is a method of combining outputs of many week classifiers, to produce a powerful
and ensemble. There are several variants of boosting algorithms, AdaBoost, BrownBoost, LogitBoost and
Gradient Boosting.
error should be the minimum one, here in this case. How to build this f(x) is a
So, how to build such an f(x)? And boosting our goal is to find, is to build the function f(x) iteratively. So,
M
we suppose that, this function f ( x )=∑m=1 hm(x). So, in particular you have a let us assume
that, each
function hm is a decision tree. So, here each function is a decision tree. And the aggregation of it is
basically the f(x) function.
Let us understand: The gradient boosted decision trees for regression in more details. So, gradient boosted
trees for regression problem. So, let us take the input set of n different data that is training dataset, which
consists of x is the set of features and yi consists of the labels, so in turn this is the training dataset of n
different samples which are given, as an input to the gradient boosted trees for regression. And M is the
number of iterations and now, the step number one assumes the initial calculate the initial, f0(x). So that
will be calculated by summing up all the label values and their average or their mean is assigned over
here. So, it will take the mean of the label values from the training dataset and that becomes f0 or the
initial f0 function, initial effects function. Now, then it will iterate for M iterations, wherein it will
calculate the residual, which is ^yi is nothing but yi-fm-1(xi) that is, of a previous iteration and so here, it
will be 0 residual and so residuals are nothing but, the differences of values in the actually labels and the
predicted labels that is called the ‘Residual’. Now, this after that, the next step would be to fit a decision
tree hm, to the targets that is yi. And then, it that means, it will generate an auxiliary training set, out of this
particular set where in, yiwill be replaced here as the labels, as the new labels which are nothing but the
residuals. And then, fm (x) will be calculated by giving up particular regularization, coefficient and
therefore, it will calculate the value of fm and in this manner it will iterate and
compute all that things. Now, as far as the regularization, coefficient
that is, nothing but I mean, it is recommended ≤ 0.1 here in this case, this
is going to be important parameter in this gradient boosted trees for
regression.
Let us assume that, the training set which is given that is Z={(x1, y1)….. (xn, yn)} and we are in x1 is
nothing but they are the features and y1 is nothing but the class labels. So, here there will be a class label y
because this is a classification problem so the classes lie between lies 0 & 1. So, class labels, let us
assume that it is 0 & 1, so these will be assigned as the labels, in the training dataset. Now, goal here is to
find out that, effects using the training dataset such that, it will minimize this particular function that is
the ∑ ( x , y)∈ T
predicted value and Y is that the target label, if it
[f ( x)≠ y]. So that means, the label f(x) is the
is not matching, so that becomes an error and the summation of all such miss classification. So, the
summation of so the aggregation of miss classification is to be minimized, so that should be the value of
f(x), which can achieve this so that, this particular prediction can be applied on the test dataset. So, test
dataset T={(x1, y1)….. (xn, yn)} So, how to build this f(x) is? The problem of gradient boosted decision
trees.
P( y=1|x)=1
1+exp ¿ ¿
where h(x) is the decision tree. So, therefore this the value of 0<P( y=1|x)<1. Therefore, we model the
probability of belonging of an object to the first class and here inside the exponential, there is a sum of
hm’s and each hm(x) is a decision tree. So, we can easily check that each expression or the probability will
always be between 0 and 1. So, it is the normal regular probability.
And now, likelihood function we have to calculate. So, the principle of maximum likelihood, can be given
by this algorithm, to find a function f(x) which maximizing the likelihood, which is equivalent to find fx
maximizing the logarithmic of the likelihood. Since, the logarithmic is the monotone function. So that can
be represented here in this case that,
n
Q (f log¿ ¿
)=∑ i=1
And we have to find out maxQ (f ). That is, which will maximize the likelihood. So, we have to find out
that f(x), which will maximize the likelihood.
Refer Slide Time :( 14:45)
Hence, we have to fit our distribution in this dataset, by way of principle of maximum likelihood. So, we
will denote it by Q ( f ), the logarithmic of the likelihood and it is now, it is the sum of all the logarithm
logarithms of the probabilities and you are going to maximize this particular function. Now, you will use
the shorthand, for this logarithmic that is L¿. So, it is the logarithmic of the probability. And here, we
emphasize that this logarithms, depends actually on the true label that is yi and our predicted values that n
is f ( x¿¿ i) ¿. And L( y ,f x )
i ( i)
now, Q ( f )=∑ i=1
So, logarithmic, a L¿ is nothing but, given as likely, given as the log of the probability of Y given xi and
n
L( yi,f ( xi)). So, yi is the, the true label, for the idea
set and f ( x¿¿ i) ¿. is, the predicted label and this logarithmic of, of this is represented by this
likelihood and the summation of this is represented by this, likelihood which has to be maximized for that
key way.
f0( x)=logp1
1−p1
and for, iterating between from 1 to M, so we will find out the gradient
gi=dL¿¿ ¿
So, this will calculate divided by differentiation of f m( x ). So, this is called the, ‘Gradient’. So, gradients
are calculated and then, it will fit a decision tree hm of x, to the target, to the target GI. So, auxiliary
training dataset, which will be used here, is that, x1 and x1 and then it will be replaced by, label will be
replaced by the gradients and this will call an, ‘Auxiliary dataset’. So, given the other rate dataset, it will
fit the decision tree that is called, ‘hm ( xi)’. And wherein the role value will be
So, ρm will be calculated and f m( x )¿ f m−1( x)+v ρm hm(x ¿¿i)¿, v is the regularization coefficient ρm
and hm ( xi). So, this process will in turn, will give the x symbol of different values. And so, it will do this
kind
of classification in this particular manner. So, let us see the stochastic boosting, so gradient boosting trees
for classification, if we use this stochastic boosting.
Refer Slide Time :( 19:49)
Then it will be represented here in this case, we are this observe a set, which is used as the training set
will be now, k = 0.5n will be created by the random sampling with the replacement, so this particular part
here, we are going to use the, the concept of the random forest, for the bootstrap generation of the data
side. So, this is the gradient boosting, gradient boosted trees + stochastic boosting is shown over here.
So, sampling with the replacement is used but here, the value of K will become half of, the size of the
total features. So here, it will be half of that particular features, will be taken up into the considerations.
So, let us see the tips for the usage, first of all it is important to understand how the regularization
parameter works. In this figure, we can see that the behavior of the gradient boosted, decision trees with
the, with, with the two variants we can see with the parameter, .1 & 0 & 0.5 that is 0.5. So, we can see
that, this one 0.1 is not that accurate and 0.5 is more accurate. So, what happens, here is that initially stage
of the learning, at the initial stage the learning, the, the variant that is 0.1 is better but because, it has the
lower testing error.
So, the recommended learning rate should be less than or equal to 0.1 that we have already seen. And the
bigger your dataset is the larger it will be the number of iterations it should have. So, the recommended
number of iteration ranges from several hundred to the several thousands. Also, the more features you
have in your dataset, the deeper will be your decision tree. And there are many general rules because the
bigger your dataset is the more features you have the more complex model you can build without over
fitting in this particular scenario.
So, somebody it is the best method, for general-purpose classification and regression problems. So, it
automatically handles the interaction of the features, because in the core, it is based on the decision trees,
which can combine several features in a single tree. But also, this algorithm is computationally scalable.
It can be effectively executed in the distributed environment, for example, in the SPARK. So, it can be
executed on top of the spark cluster.
Spark ML, based decision trees and examples, we have to see the programming aspect of it, how using is
part ml we will now use, the decision tree and decision tree and ensemble.
And first we will see that, we have to, we have to first see the decision trees how the decision tree will be
implemented over the spark image and then we can extend it for the random forest and gradient boosted
trees. So, the first step here is to create the spark context and in the spark session. Which is shown here, in
this particular steps that we have to create the spark context and we have to also create the spark session.
The dataset into data frames and then we can see the features that, it has the data has an ID number and
the label that is categorical label, which can be denoted by 0 & 1 or you can see that it is, it is M and B,
finally which will be converted into 1 & 0. So, this particular dataset has the, the features and it has the
labels and it has an ID. So, there are three different important parameter sighs. So, let us take this
particular case, where all these important parts are, there in the dataset. So, now we can further look into
the dataset.
And what we, we all need to transform this label, which is given in M or B to the, to the numeric value
and using string index or object, we can do this and that is the string index our object will create this, into
the label values.
so that, all the values will become here. So, all these labels will be converted into the values using string
indexer.
Now, then we will make the training test is splitting, into the proportion 70 to 30 that when 70% is the
training portion and 30% will be the test portion of the dataset. So, first we model with that 70% dataset
to train the model on a single decision tree and then we will use the 30% of the dataset for testing
purpose. So, this is to be done in this particular manner using random split, of 70 by 30 and then we will
train the decision tree model, using decision tree classifier and we'll apply the label in text, in this
particular case and then we will fit, the decision tree on the training dataset. So, this will become this will
build a model called, ‘Decision Tree Model ‘. It will build and this will be represented as DT model, as
the variable. Now, you can see this particular model has how many nodes? What is the depth? And what
are the important features? And so on.
And so after the, so these things now we are making, the import decision tree classifier we can create the
class that is responsible for the training and then we call the, ‘Method fit’, to the training data and obtain
the decision tree model that we have shown.
Now, so the training was done in a very fast manner and now the we have to see the results, so the
number of nodes and depth out the decision tree and a feature importance and number of features used in
the decision tree and so on. These different things we can inspect.
Refer Slide Time :( 31:01)
And now, we are going to see the decision tree and we can take this one, so we can print this decision,
decision tree model and you can see that, it is nothing but an if-then-else and then that means all the
internal nodes of a DC entry and finally the leap will be having the predictions. So, once the decision tree
is built,
now we can use this decision tree, to for the test data, to obtain the predictions. So, now we can use the
test data and use the same decision tree model and now, we will perform the, the predictions. So, so the
predictions are now, there and then we will evaluate this particular, the predictions using multi class,
multi class classifier evaluator, we will evaluate these predictions, which are done by this one decision
tree for its accuracy. So, what we will find here is that?
And now, we are going to see the another method, which is called a, ‘Gradient Boosted Decision Trees’.
So, first of all we import, the data and then create the object which will do the classification. And here,
we are specifying the label column is a label indexed that we have seen in the previous example also and
the features column as the features. Actually, it is not mandatory to specify the feature column, we can do
it either using the assignment or without this argument, now the thing is, so after having done this, now
we will fit this particular model and apply this particular gradient boosted, classifier using label indexed
the features that we have seen, now we will fit this particular model, onto the training dataset and we will
get the model, gradient boosted a decision tree model. So, once we obtain the model, then we will see, its
where then we will be using this particular model, we will now apply the test data on it this particular
model and now we using a multi-class classification evaluator, we will evaluate its predictions.
So, let us see, so this we are going to do 100 iterations and default step is 0.1. So, after that, so we will
see that here the accuracy, here the error is here the, the test error is quite. So, basically this has improved
or the DC entry model.
Now, we are going to see the random forest. And random forest again, the same dataset we will take and
but here the classifier we are going to change as random forest classifier and we will fit the training data
on random forest classifier and we will get a model, which is called, ‘Random Forest’, ‘Forest Model’.
Now, using this particular model we will, now we will transform the test data we will apply the test it on
this particular model and get the predictions. So, now after getting the predictions we are not evaluating
these particular predictions, using multi-class classification evaluator and this will now, evaluate its
prediction accuracies.
And this we will now, be able to see that, the test error is very less that is, than the previous decision
trees. Hence, this we can see that in this example, the testing accuracy of the random forest was the best
with the other data the situation may be quite different. In general case, the bigger your dataset is more
features, it has the quality of complex algorithm like gradient boosted or the forest will be much better.
Then we'll build the spark session, will build the spark session, which will create the data frames and then
we will use the dataset, read the dataset, load the data file and now, we will create the data frames, DF as
the input data frame.
So, then we will apply this cross validation using parameter builder and we have to also see the cross
validation.
Refer Slide Time :( 37:39)
So, then we import their class validator and parameter great builder class and you are now creating the
parameter grid builder class. For example, you want to select the best maximum depth of the decision tree
in the range from 1 to 8. And we don't have the parameter build grid builder class and we create the
evaluator, so we want to select the model that which has the best accuracy among others. And using this,
all these parameters whatever we have said now we are going to evaluate using multi-class classification
evaluator.
And now, we will see the evaluator will now, evaluate the cross validator. And cross validator will fit it to
the input function and it will select the best model into the different stages. So, we create the evil water so
we want to select the model which has the best accuracy among others so we create a cross validator class
and pass a pipeline into this class parameter great and the evaluator.
And finally, we select the number of folds and the number of holes should be not less than five or ten.
And the number of folds we have selected and after that,
we create the CV model and takes some time because spark need to make the training and evaluation, the
qualified the ten times.
Now, we can see that the average accuracy amount of fold for each of the failure of the three depths.
Refer Slide Time :( 39:30)
Now, we can see and the first stage of our pipeline was the decision tree, so you can get the best model
and the best model has the depth six and it has 47 nodes in this case.
So, conclusion in this lecture we have discussed the concept of random forests, gradient boosted decision
trees and we have also covered a case study using a Spark ML programming, on decision trees and other
learning tree ensembles. Thank you.