Pprett/gradient boosting #6

glouppe · 2012-03-19T14:43:46Z

This is my first bunch of commits regarding your PR.

I really like how you managed to remove the "terminal" mechanisms from the Tree code :)

My changes are the following:

Moved _compute_feature_importances into Tree
Moved _build_tree into Tree
Use DTYPE instead of float64
Cosmits and pep8

Most of those do not actually concern the boosting module. I still have to review the gradient_boosting.py file into more depth. (Later today or tomorrow).

pprett · 2012-03-19T19:41:51Z

@glouppe some of the tests fail due to numerical issues (an aftermath of changing dtype). I fixed those but I notice a performance regression for the following benchmark::

import numpy as np
from sklearn import datasets
from sklearn.ensemble import gradient_boosting

X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
X = X.astype(np.float32)

gbrt = gradient_boosting.GradientBoostingClassifier(n_estimators=250,
                                                min_samples_split=5,
                                                max_depth=1,
                                                learn_rate=1.0,
                                                random_state=0)
%timeit gbrt.fit(X, y)

it goes from::

1 loops, best of 3: 1.32 s per loop

to::

1 loops, best of 3: 1.97 s per loop

pprett · 2012-03-19T19:51:20Z

hmm... I think I hunted it down::

379       250       768998   3076.0     42.2              residual = loss.negative_gradient(y, y_pred, k=k)

This is 4 times the usual timing due to y and y_pred having different dtype.

pprett · 2012-03-19T19:58:12Z

sklearn/tree/tree.py

        The error of the (best) split.
        For leaves `init_error == `best_error`.

-    init_error : np.ndarray of float64
+    init_error : np.ndarray of DTYPE


why should init_error or best_error have type DTYPE which is the dtype of the data array? Either use np.float32 or np.float64. I tend to use np.float64 whenever possible (i.e. when memory consumption is not an issue).

pprett · 2012-03-19T20:09:13Z

wow... seems like 32bit floating point arithmetic in numpy is substantially slower than 64bit arithmetic::

%timeit bd.negative_gradient(y, y_pred)
1000 loops, best of 3: 546 us per loop

vs 32bit::

%timeit bd.negative_gradient(y_float32, y_pred_float32)
100 loops, best of 3: 3.01 ms per loop

it seems that np.exp is the one to blame.

glouppe

Wow that's huge. I was not aware of this. Actually, my machine is 32 bits that's the reason why I like to have the possibility to not use float64. I will have a deeper look at it tomorrow. I'll revert my changes if I come to no good solution.

pprett · 2012-03-19T20:58:12Z

it might be slower on 64bit machines but a 6-fold increase is too
large - numpy has a npy_expf function that operates on float32 but
I don't know whether it is exposed to the numpy API... i keep you
posted

2012/3/19 Gilles Louppe
reply@reply.github.com:

Wow that's huge. I was not aware of this. Actually, my machine is 32 bits that's the reason why I like to have the possibility to not use float64. I will have a deeper look at it tomorrow. I'll revert my changes if I come to no good solution.

Reply to this email directly or view it on GitHub:
#6 (comment)

Peter Prettenhofer

pprett · 2012-03-19T21:32:27Z

Gilles, I just checked the other (regression) models in sklearn, it seems that only tree and ensemble use 32bit floating point for the target values. SVM and Lasso/ElasticNet/SGDRegressor explicitly convert to 64bit. I'd rather use 64bit for tree and ensemble too - this has the advantage that results are more stable (I remember we use np.mean(y) somewhere in our code which might pose a underflow problem) - AFAIK we choose 32bit because of memory consumption which is only an issue for X but not y.

glouppe · 2012-03-19T22:08:03Z

Okay, I agree. I'll revert my changes tomorrow.

On 19 March 2012 22:32, Peter Prettenhofer
reply@reply.github.com
wrote:

Gilles, I just checked the other (regression) models in sklearn, it seems that only tree and ensemble use 32bit floating point for the target values. SVM and Lasso/ElasticNet/SGDRegressor explicitly convert to 64bit. I'd rather use 64bit for tree and ensemble too - this has the advantage that results are more stable (I remember we use np.mean(y) somewhere in our code which might pose a underflow problem) - AFAIK we choose 32bit because of memory consumption which is only an issue for X but not y.

Reply to this email directly or view it on GitHub:
#6 (comment)

This reverts commit 3509e16. Conflicts: sklearn/ensemble/gradient_boosting.py sklearn/tree/tree.py

glouppe · 2012-03-20T09:09:14Z

I just pushed a reverse commit.

pprett · 2012-03-20T09:37:30Z

@glouppe thanks - I updated whats_new.rst and merged

nitpick fixes, pep8 and fix math equations

Revised text classification chapter

glouppe added 6 commits March 19, 2012 13:39

PEP8

f6a36a3

ENH: move _compute_feature_importance into Tree

ff09f48

ENH: Use DTYPE instead of float64

3509e16

Cosmit

ec9ed89

ENH: Moved _build_tree into Tree

934d373

Cosmits + Fix to a test

47a335b

pprett reviewed Mar 19, 2012
View reviewed changes

Revert "ENH: Use DTYPE instead of float64"

cc2bab9

This reverts commit 3509e16. Conflicts: sklearn/ensemble/gradient_boosting.py sklearn/tree/tree.py

pprett merged commit cc2bab9 into pprett:gradient_boosting Mar 20, 2012

pprett pushed a commit that referenced this pull request Jul 25, 2013

Merge pull request #6 from jaquesgrobler/cov_doc_fix

4e7e637

nitpick fixes, pep8 and fix math equations

pprett pushed a commit that referenced this pull request Mar 18, 2014

Merge pull request #6 from larsmans/master

5e57728

Revised text classification chapter

pprett pushed a commit that referenced this pull request Mar 18, 2014

remove reference to removed API, fixes #6

54fb2de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pprett/gradient boosting #6

Pprett/gradient boosting #6

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pprett/gradient boosting #6

Pprett/gradient boosting #6

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!