Lec 29

Big Data Computing
Prof. Rajiv Misra

Computer Science andEngineering,IIT Patna
Lecture - 29
Big Data Predictive Analytics
(Part-II)
Gradient boosted decision trees for regression.
Refer Slide Time :( 0:18)
Boosting: Boosting it is a method of combining outputs of many week classifiers, to produce a powerful
and ensemble. There are several variants of boosting algorithms, AdaBoost, BrownBoost, LogitBoost and
Gradient Boosting.

Now, we see in the big data that, the size of the training examples is a very large. And also, a large
number of, large number of features describing the objects are present. So, therefore, this one kind of
scenario occurs in a situation of a big data. So, in this situation it is very natural to assume that, you would
like to train a really complex model even having such a great amount of data and hopefully, this model
will be accurate. There are two basic ways in the machine learning to build the complex models. The first
one is that, you have to start with a very complex model from the very beginning and fits its parameters
with the data. So, this is exactly the way how the neural networks operate. So they operate with the very
complex model, in the beginning itself and they will fit the data, they will fit the parameters and the data
will be now, then predicted based on that fitting of its parameters. Now, the second way is to build a
complex model iteratively. So, we can build a complex model iteratively, that is, with each step requires
the training of a simple model. So in the context of boosting, these models are called, ‘Weak Classifiers or
the Base Classifiers. Therefore, it is an ensemble of weak classifiers and which makes this particular
model iteratively.

Let us see, the regression. How this particular concept can be applied in the regression, in a gradient
boosted our decision trees? So, let us assume that there are samples which are given, which are specified
by Z, where x1 is the set of features, in the dataset and y1 is the label and x1 to xn are n different samples, in
the training dataset. Now, the goal here is to find f(x), using this training set such that, this particular
min (f ( x )−y )2, the minimum such error will be there
∑ ( x , y )∈ T and so at the test set given, the test set, this
error should be the minimum one, here in this case. How to build this f(x) is a
question? Refer Slide Time :( 3:28)
So, how to build such an f(x)? And boosting our goal is to find, is to build the function f(x) iteratively. So,
M
we suppose that, this function f ( x )=∑m=1 hm(x). So, in particular you have a let us assume
that, each
function hm is a decision tree. So, here each function is a decision tree. And the aggregation of it is
basically the f(x) function.
Let us understand: The gradient boosted decision trees for regression in more details. So, gradient boosted
trees for regression problem. So, let us take the input set of n different data that is training dataset, which
consists of x is the set of features and yi consists of the labels, so in turn this is the training dataset of n
different samples which are given, as an input to the gradient boosted trees for regression. And M is the
number of iterations and now, the step number one assumes the initial calculate the initial, f0(x). So that
will be calculated by summing up all the label values and their average or their mean is assigned over
here. So, it will take the mean of the label values from the training dataset and that becomes f0 or the
initial f0 function, initial effects function. Now, then it will iterate for M iterations, wherein it will
calculate the residual, which is ^yi is nothing but yi-fm-1(xi) that is, of a previous iteration and so here, it
will be 0 residual and so residuals are nothing but, the differences of values in the actually labels and the
predicted labels that is called the ‘Residual’. Now, this after that, the next step would be to fit a decision
tree hm, to the targets that is yi. And then, it that means, it will generate an auxiliary training set, out of this
particular set where in, yiwill be replaced here as the labels, as the new labels which are nothing but the
residuals. And then, fm (x) will be calculated by giving up particular regularization, coefficient and
therefore, it will calculate the value of fm and in this manner it will iterate and
compute all that things. Now, as far as the regularization, coefficient
that is, nothing but I mean, it is recommended ≤ 0.1 here in this case, this
is going to be important parameter in this gradient boosted trees for
regression.

So here, you might have noticed that, gradient boosting is somewhat similar to the gradient descent in the
optimization theory. That if we want to minimize the function in the optimization theory using gradient
descent, we make a small step in the direction opposite to the gradient. So, gradient of a function by
definition is a vector which points to the direction with the fastest increase. Since we want to minimize
the function, we must move to the direction opposite to the gradient, to ensure the convergence, we must
make very small steps, so you are multiplying each gradient by a small constant, which is called a, ‘Step
Size’. It is very similar to what we do in the gradient boosting. So, gradient boosting is considered,
minimization in the functionality space. That is fm(x)= f0(x)+ vh1(x)+ vh2(x)+…. So, boosting is the
minimization in the functional space.

So, let us see the summary and boosting is the method for combining the outputs of many weak
classifiers or the regressor to produce a powerful and ensemble. And gradient boosting is a gradient
decent minimization of the target functions in the functional space. So, gradient boosting with the
decision tree is considered to be the best algorithm for general purpose classification or the regression
problem. Now, let us see the gradient boosted decision trees for classification problem.
Let us assume that, the training set which is given that is Z={(x1, y1)….. (xn, yn)} and we are in x1 is
nothing but they are the features and y1 is nothing but the class labels. So, here there will be a class label y
because this is a classification problem so the classes lie between lies 0 & 1. So, class labels, let us
assume that it is 0 & 1, so these will be assigned as the labels, in the training dataset. Now, goal here is to
find out that, effects using the training dataset such that, it will minimize this particular function that is
the ∑ ( x , y)∈ T
predicted value and Y is that the target label, if it
[f ( x)≠ y]. So that means, the label f(x) is the
is not matching, so that becomes an error and the summation of all such miss classification. So, the
summation of so the aggregation of miss classification is to be minimized, so that should be the value of
f(x), which can achieve this so that, this particular prediction can be applied on the test dataset. So, test
dataset T={(x1, y1)….. (xn, yn)} So, how to build this f(x) is? The problem of gradient boosted decision
trees.

So, gradient boosted trees for classification. Here, we have to see how we are going to build, such a
function f(x). So, we use the probabilistic method by using the following expression, so the probability
P( y=1|x)=1
1+exp ¿ ¿
where h(x) is the decision tree. So, therefore this the value of 0<P( y=1|x)<1. Therefore, we model the
probability of belonging of an object to the first class and here inside the exponential, there is a sum of
hm’s and each hm(x) is a decision tree. So, we can easily check that each expression or the probability will
always be between 0 and 1. So, it is the normal regular probability.

This particular function is called the, ‘Sigmoid Function’, which maps all the real values, into the range
between 0 & 1. So, this particular sigmoid function is 1

−x
1+e .
So, here also the same equation is appearing 1

. This is called sigmoid function. So, let us
1+exp (−f ( x))
M
denote f ( x )=∑m=1 can write the probability of the
hm(x). It is an ensemble of decision trees, then you
belonging to a 1st class in a simple way using f(x) and the main idea which is used here is called the,
‘Principle of Maximum Likelihood’. So, what is this principle of like maximum likelihood? So,
likelihood is a probability of absorbing some data, given the statistical model, if we have the dataset, with
an object from 1 to n, then the probability of absorbing such dataset is the multiplication of probabilities
of all single objects, the multiplication is called, ‘Likelihood’. And here, that can be expressed, likelihood
n
as the multiplication of the probabilities for all singlemultiplication of all the probabilities. Refer Slide
objects that is nothing but the
∏ i=1 Time :( 13:53)
∙ p ¿). So, therefore it comes out to be the
p ¿ ¿)=p ¿)∙……..
And now, likelihood function we have to calculate. So, the principle of maximum likelihood, can be given
by this algorithm, to find a function f(x) which maximizing the likelihood, which is equivalent to find fx
maximizing the logarithmic of the likelihood. Since, the logarithmic is the monotone function. So that can
be represented here in this case that,
n
Q (f log¿ ¿
)=∑ i=1
And we have to find out maxQ (f ). That is, which will maximize the likelihood. So, we have to find out
that f(x), which will maximize the likelihood.
Hence, we have to fit our distribution in this dataset, by way of principle of maximum likelihood. So, we
will denote it by Q ( f ), the logarithmic of the likelihood and it is now, it is the sum of all the logarithm
logarithms of the probabilities and you are going to maximize this particular function. Now, you will use
the shorthand, for this logarithmic that is L¿. So, it is the logarithmic of the probability. And here, we
emphasize that this logarithms, depends actually on the true label that is yi and our predicted values that n
is f ( x¿¿ i) ¿. And L( y ,f x )
i ( i)
now, Q ( f )=∑ i=1
So, logarithmic, a L¿ is nothing but, given as likely, given as the log of the probability of Y given xi and
n
which is nothing but,Q ( f )=∑ i=1

the data object in the
L( yi,f ( xi)). So, yi is the, the true label, for the idea
set and f ( x¿¿ i) ¿. is, the predicted label and this logarithmic of, of this is represented by this
likelihood and the summation of this is represented by this, likelihood which has to be maximized for that
key way.

Let us see the gradient boosted trees for classification, by putting it everything whatever we have
discussed. Now, the algorithm for gradient boosted trees for classification. Here, the input set Z is given
as, xi, yi where in xi you know that, it is a features in the dataset and why is the labels, which are assigned
as, the categorical type that is labels given. And M is the number of iterations and the for, the first class,
the part of the objects of the first class is
f0( x)=logp1
1−p1
and for, iterating between from 1 to M, so we will find out the gradient
gi=dL¿¿ ¿
So, this will calculate divided by differentiation of f m( x ). So, this is called the, ‘Gradient’. So, gradients
are calculated and then, it will fit a decision tree hm of x, to the target, to the target GI. So, auxiliary
training dataset, which will be used here, is that, x1 and x1 and then it will be replaced by, label will be
replaced by the gradients and this will call an, ‘Auxiliary dataset’. So, given the other rate dataset, it will
fit the decision tree that is called, ‘hm ( xi)’. And wherein the role value will be
ρm=argmax ρQ[f m−1( x )+ρhm( x )]
So, ρm will be calculated and f m( x )¿ f m−1( x)+v ρm hm(x ¿¿i)¿, v is the regularization coefficient ρm
and hm ( xi). So, this process will in turn, will give the x symbol of different values. And so, it will do this
kind
of classification in this particular manner. So, let us see the stochastic boosting, so gradient boosting trees
for classification, if we use this stochastic boosting.
Then it will be represented here in this case, we are this observe a set, which is used as the training set
will be now, k = 0.5n will be created by the random sampling with the replacement, so this particular part
here, we are going to use the, the concept of the random forest, for the bootstrap generation of the data
side. So, this is the gradient boosting, gradient boosted trees + stochastic boosting is shown over here.
And this way, sampling with the replacement is there.

So, here we have to see that the size of the sample will be reduced by k= 0.5n that is here, the author a
training set, will be reduced by the size half and it will be created, by random sampling with the
replacement. And this is called the, ‘Concept of Bagging’, which will be used over here in the stochastic
boosting.
So, sampling with the replacement is used but here, the value of K will become half of, the size of the
total features. So here, it will be half of that particular features, will be taken up into the considerations.

Hence, as we reduce the size, this becomes more efficient this particular case.
So, let us see the tips for the usage, first of all it is important to understand how the regularization
parameter works. In this figure, we can see that the behavior of the gradient boosted, decision trees with
the, with, with the two variants we can see with the parameter, .1 & 0 & 0.5 that is 0.5. So, we can see
that, this one 0.1 is not that accurate and 0.5 is more accurate. So, what happens, here is that initially stage
of the learning, at the initial stage the learning, the, the variant that is 0.1 is better but because, it has the
lower testing error.

But, later on so at each iteration, you measure the testing error, up our example of on the holed out
dataset. But, eventually the variant with the lower regularization that is 0.5, reach the lower testing error,
finally this variant turns out to be the superior. So, it is very typical behavior and you should not stop your
algorithm after several dozens of iterations, so you should proceed over until it converges. So,
convergence happens when you're testing error does not change a lot. The variant with the lower
regularization convert more slowly but eventually it builds a better model.
So, the recommended learning rate should be less than or equal to 0.1 that we have already seen. And the
bigger your dataset is the larger it will be the number of iterations it should have. So, the recommended
number of iteration ranges from several hundred to the several thousands. Also, the more features you
have in your dataset, the deeper will be your decision tree. And there are many general rules because the
bigger your dataset is the more features you have the more complex model you can build without over
fitting in this particular scenario.
So, somebody it is the best method, for general-purpose classification and regression problems. So, it
automatically handles the interaction of the features, because in the core, it is based on the decision trees,
which can combine several features in a single tree. But also, this algorithm is computationally scalable.
It can be effectively executed in the distributed environment, for example, in the SPARK. So, it can be
executed on top of the spark cluster.

But also, this algorithm has very good predictive power. But, unfortunately the model is not interpretable.
And that problem we have also seen in the random forest but here, the predictive power is better than that
in the forest. So, the final and simple effect is very large and cannot be analyzed by the human experts.
So, obviously it is not having the good interpretation of this due to the complex, nature of this particular
model. So, there are always a trade-off in the machine learning, between the predictive power and
interpretability, because the more complex and accurate your model is the harder is the analyze sis of this
model, to be interpreted by the humans.
Spark ML, based decision trees and examples, we have to see the programming aspect of it, how using is
part ml we will now use, the decision tree and decision tree and ensemble.

So here, we will talk about a spark ML for doing classification regression, with the ensemble trees.
And first we will see that, we have to, we have to first see the decision trees how the decision tree will be
implemented over the spark image and then we can extend it for the random forest and gradient boosted
trees. So, the first step here is to create the spark context and in the spark session. Which is shown here, in
this particular steps that we have to create the spark context and we have to also create the spark session.

And this is spark session is required for building the data frame. And then, we can download the dataset
and now we can see the dataset after downloading.
The dataset into data frames and then we can see the features that, it has the data has an ID number and
the label that is categorical label, which can be denoted by 0 & 1 or you can see that it is, it is M and B,
finally which will be converted into 1 & 0. So, this particular dataset has the, the features and it has the
labels and it has an ID. So, there are three different important parameter sighs. So, let us take this
particular case, where all these important parts are, there in the dataset. So, now we can further look into
the dataset.

You can see that, these datasets. How, so these datasets, so these features are the result of the analysis and
it has around 569 different examples, in the dataset.
And what we, we all need to transform this label, which is given in M or B to the, to the numeric value
and using string index or object, we can do this and that is the string index our object will create this, into
the label values.

Into the numeric values and then we can create the data frame, which is stored in the distributed manner
on the on the cluster. So, this particular data frame will be, created with the label and the feature, so these
two part will be used up in the data frame, which will be input to our system. So, string indexer and then
it will transform,
so that, all the values will become here. So, all these labels will be converted into the values using string
indexer.

So now, after index, label index and all these label values are converted, from letter to the numeric values.
So, here can have this numeric value either 1 or it is 0, as the label values. So, we have now two
important things, one is the feature and the other is label indexed, in the data frame which is shown over
here, this particular data frame will be now used for further analysis.
Now, then we will make the training test is splitting, into the proportion 70 to 30 that when 70% is the
training portion and 30% will be the test portion of the dataset. So, first we model with that 70% dataset
to train the model on a single decision tree and then we will use the 30% of the dataset for testing
purpose. So, this is to be done in this particular manner using random split, of 70 by 30 and then we will
train the decision tree model, using decision tree classifier and we'll apply the label in text, in this
particular case and then we will fit, the decision tree on the training dataset. So, this will become this will
build a model called, ‘Decision Tree Model ‘. It will build and this will be represented as DT model, as
the variable. Now, you can see this particular model has how many nodes? What is the depth? And what
are the important features? And so on.
And so after the, so these things now we are making, the import decision tree classifier we can create the
class that is responsible for the training and then we call the, ‘Method fit’, to the training data and obtain
the decision tree model that we have shown.
Now, so the training was done in a very fast manner and now the we have to see the results, so the
number of nodes and depth out the decision tree and a feature importance and number of features used in
the decision tree and so on. These different things we can inspect.
And now, we are going to see the decision tree and we can take this one, so we can print this decision,
decision tree model and you can see that, it is nothing but an if-then-else and then that means all the
internal nodes of a DC entry and finally the leap will be having the predictions. So, once the decision tree
is built,
now we can use this decision tree, to for the test data, to obtain the predictions. So, now we can use the
test data and use the same decision tree model and now, we will perform the, the predictions. So, so the
predictions are now, there and then we will evaluate this particular, the predictions using multi class,
multi class classifier evaluator, we will evaluate these predictions, which are done by this one decision
tree for its accuracy. So, what we will find here is that?

That accuracy is very high and the error is very less. And so, the testing error is only thirty percent, so the
model is quite accurate. So, with only three percent error, the model is quite accurate.
And now, we are going to see the another method, which is called a, ‘Gradient Boosted Decision Trees’.
So, first of all we import, the data and then create the object which will do the classification. And here,
we are specifying the label column is a label indexed that we have seen in the previous example also and
the features column as the features. Actually, it is not mandatory to specify the feature column, we can do
it either using the assignment or without this argument, now the thing is, so after having done this, now
we will fit this particular model and apply this particular gradient boosted, classifier using label indexed
the features that we have seen, now we will fit this particular model, onto the training dataset and we will
get the model, gradient boosted a decision tree model. So, once we obtain the model, then we will see, its
where then we will be using this particular model, we will now apply the test data on it this particular
model and now we using a multi-class classification evaluator, we will evaluate its predictions.
So, let us see, so this we are going to do 100 iterations and default step is 0.1. So, after that, so we will
see that here the accuracy, here the error is here the, the test error is quite. So, basically this has improved
or the DC entry model.
Now, we are going to see the random forest. And random forest again, the same dataset we will take and
but here the classifier we are going to change as random forest classifier and we will fit the training data
on random forest classifier and we will get a model, which is called, ‘Random Forest’, ‘Forest Model’.
Now, using this particular model we will, now we will transform the test data we will apply the test it on
this particular model and get the predictions. So, now after getting the predictions we are not evaluating
these particular predictions, using multi-class classification evaluator and this will now, evaluate its
prediction accuracies.
And this we will now, be able to see that, the test error is very less that is, than the previous decision
trees. Hence, this we can see that in this example, the testing accuracy of the random forest was the best
with the other data the situation may be quite different. In general case, the bigger your dataset is more
features, it has the quality of complex algorithm like gradient boosted or the forest will be much better.

Now, let us see the cross-validation using spark ml. So, class validation helps to assess, the quality of
machine learning model and to find the best model, among the family of the models. First of all, we have
to create the spark context that is, shown over here and then we will, do the spark session.
Then we'll build the spark session, will build the spark session, which will create the data frames and then
we will use the dataset, read the dataset, load the data file and now, we will create the data frames, DF as
the input data frame.

And now, we will apply we after doing various transformations. Now, we will input data we can see it is
already indexed now we have to do the cross validation. So, what steps are required for doing cross
validation with the dataset? So, for example, we want to select the best parameters of a single decision
tree. And we create the object decision tree then we should create the pipeline. So, here we are going to
create the pipeline and in this pipeline stage we are only using the decision tree based pipeline.
So, then we will apply this cross validation using parameter builder and we have to also see the cross
validation.
So, then we import their class validator and parameter great builder class and you are now creating the
parameter grid builder class. For example, you want to select the best maximum depth of the decision tree
in the range from 1 to 8. And we don't have the parameter build grid builder class and we create the
evaluator, so we want to select the model that which has the best accuracy among others. And using this,
all these parameters whatever we have said now we are going to evaluate using multi-class classification
evaluator.
And now, we will see the evaluator will now, evaluate the cross validator. And cross validator will fit it to
the input function and it will select the best model into the different stages. So, we create the evil water so
we want to select the model which has the best accuracy among others so we create a cross validator class
and pass a pipeline into this class parameter great and the evaluator.
And finally, we select the number of folds and the number of holes should be not less than five or ten.
And the number of folds we have selected and after that,
we create the CV model and takes some time because spark need to make the training and evaluation, the
qualified the ten times.
Now, we can see that the average accuracy amount of fold for each of the failure of the three depths.
Now, we can see and the first stage of our pipeline was the decision tree, so you can get the best model
and the best model has the depth six and it has 47 nodes in this case.

Then we can view this model, to make the prediction at any other datasets. So, in parameter great builder
we can use several parameters. For example, maximum depth and some other parameter of the decision
tree, for example, minimum instances per node and select some other grid here. But, in this example we
did not do it due to the simplicity and if we evaluate only one parameter the training is much faster.
So, conclusion in this lecture we have discussed the concept of random forests, gradient boosted decision
trees and we have also covered a case study using a Spark ML programming, on decision trees and other
learning tree ensembles. Thank you.

Lec 29

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Lec 29

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 29

Uploaded by

Copyright:

Available Formats

Big Data Computing

Prof. Rajiv Misra

Refer Slide Time :( 0:18)

Refer Slide Time :( 0:37)

Refer Slide Time :( 2:20)

question? Refer Slide Time :( 3:28)

Refer Slide Time :( 4:00)

Refer Slide Time :( 7:37)

Refer Slide Time :( 8:38)

Refer Slide Time :( 9:13)

Refer Slide Time :( 11:12)

Refer Slide Time :( 12:12)

between 0 & 1. So, this particular sigmoid function is 1

Refer Slide Time :( 12:29)

So, here also the same equation is appearing 1

which is nothing but,Q ( f )=∑ i=1

Refer Slide Time :( 16:42)

ρm=argmax ρQ[f m−1( x )+ρhm( x )]

Refer Slide Time :( 20:32)

And this way, sampling with the replacement is there.

Refer Slide Time :( 21:07)

Refer Slide Time :( 21:26)

Refer Slide Time :( 21:33)

Refer Slide Time :( 22:21)

Refer Slide Time :( 23:04)

Refer Slide Time :( 23:39)

Refer Slide Time :( 24:09)

Refer Slide Time :( 24:53)

Refer Slide Time :( 25:10)

Refer Slide Time :( 25:19)

Refer Slide Time :( 25:59)

Refer Slide Time :( 26:10)

Refer Slide Time :( 27:05)

Refer Slide Time :( 27:24)

Refer Slide Time :( 27:48)

Refer Slide Time :( 28:18)

Refer Slide Time :( 30:14)

Refer Slide Time :( 30:36)

Refer Slide Time :( 31:23)

Refer Slide Time :( 32:08)

Refer Slide Time :( 32:28)

Refer Slide Time :( 34:09)

Refer Slide Time :( 34:33)

Refer Slide Time :( 35:28)

Refer Slide Time :( 36:02)

Refer Slide Time :( 36:27)

Refer Slide Time :( 36:50)

Refer Slide Time :( 37:25)

Refer Slide Time :( 38:22)

Refer Slide Time :( 38:55)

Refer Slide Time :( 39:07)

Refer Slide Time :( 39:21)

Refer Slide Time :( 39:42)

Refer Slide Time :( 40:09)

You might also like