8000 [MRG+3] Add mean absolute error splitting criterion to DecisionTreeRe… · olologin/scikit-learn@da560cc · GitHub
[go: up one dir, main page]

Skip to content

Commit da560cc

Browse files
nelson-liuolologin
authored andcommitted
[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor (scikit-learn#6667)
* feature: add initial node_value method * testing code for node_impurity and node_value This code runs into 'Bus Error: 10' at node_value final assignment. * fix: node_value now correctly calculating weighted median for sorted data. Still need to change the code to work with unsorted data. * fix: node_value now correctly calculates median regardless of initial order * fix: correct bug in calculating median when taking midpoint is necessary * feature: add initial version of children_impurity * feature: refactor median calculation into one function * fix: fix use of DOUBLE_t vs double * feature: move helper functions to _utils.pyx, fix mismatched pointer type * fix: fix some bugs in children_impurity method * push a debug version to try to solve segfault * push latest changes, segfault probably happening bc of something in _utils.pyx * fix: fix segfault in median calculation and remove excessive logging * chore: revert some misc spacing changes I accidentally made * chore: one last spacing fix in _splitter.pyx * feature: don't calculate weighted median if no weights are passed in * remove extraneous logging statement * fix: fix children impurity calculation * fix: fix bug with children impurity not being initally set to 0 * fix: hacky fix for a float accuracy error * fix: incorrect type cast in median array generation for node_impurity * slightly tweak node_impurity function * fix: be more explicit with casts * feature: revert cosmetic changes and free temporary arrays * fix: only free weight array in median calcuation if it was created * style: remove extraneous newline / trigger CI build * style: remove extraneous 0 from range * feature: save sorts within a node to speed it up * fix: move parts of dealloc to regression criterion * chore: add comment to splitter to try to force recythonizing * chore: add comment to _tree.pyx to try to force recythonizing * chore: add empty comment to gradient boosting to force recythonizing * fix: fix bug in weighted median * try moving sorted values to a class variable * feature: refactor criterion to sort once initially, then draw all samples from this sorted data * style: remove extraneous parens from if condition * implement median-heap method for calculating impurity * style: remove extra line * style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR * feature: change minmaxheap to internally use sorted arrays * refactored MAE and push to share work * fix errors wrt median insertion case * spurious comment to force recythonization * general code cleanup * fix typo in _tree.pyx * removed some extraneous comments * [ci skip] remove earlier microchanges * [ci skip] remove change to priorityheap * [ci skip] fix indentation * [ci skip] fix class-specific issues with heaps * [ci skip] restore a newline * [ci skip] remove microchange to refactor later * reword a comment * remove heapify methods from queue class * doc: update docstrings for dt, rf, and et regressors * doc: revert incorrect spacing to shorten diff * convert get_median to return value directly * [ci skip] remove accidental whitespace * remove extraneous unpacking of values * style: misc changes to identifiers * add docstrings and more informative variable identifiers * [ci skip] add trivial comments to recythonize * remove trivial comments for recythonizing * force recythonization for real this time * remove trivial comments for recythonization * rfc: harmonize arg. names and remove unnecessary checks * convert allocations to safe_realloc * fix bug in weighted case and add tests for MAE * change all medians to DOUBLE_t * add loginc allocate mediancalculators once, and reset otherwise * misc style fixes * modify cinit of regressioncriterion to take n_samples * add MAE formula and force rebuild bc. travis was down * add criterion parameter to gradient boosting and add forest tests * add entries to what's new
1 parent 635be02 commit da560cc

File tree

9 files changed

+794
-20
lines changed

9 files changed

+794
-20
lines changed

doc/whats_new.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,14 @@ New features
117117
and Harabaz score to evaluate the resulting clustering of a set of points.
118118
By `Arnaud Fouchet`_ and `Thierry Guillemot`_.
119119

120+
- Added a new splitting criterion for :class:`tree.DecisionTreeRegressor`,
121+
the mean absolute error. This criterion can also be used in
122+
:class:`ensemble.ExtraTreesRegressor`,
123+
:class:`ensemble.RandomForestRegressor`, and the gradient boosting
124+
estimators. (`#6667
125+
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
126+
Liu`_.
127+
120128
Enhancements
121129
............
122130

@@ -142,6 +150,11 @@ Enhancements
142150
provided as a percentage of the training samples. By
143151
`yelite`_ and `Arnaud Joly`_.
144152

153+
- Gradient boosting estimators accept the parameter ``criterion`` to specify
154+
to splitting criterion used in built decision trees. (`#6667
155+
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
156+
Liu`_.
157+
145158
- Codebase does not contain C/C++ cython generated files: they are
146159
generated during build. Distribution packages will still contain generated
147160
C/C++ files. By `Arthur Mensch`_.
@@ -4286,3 +4299,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
42864299
.. _Sebastian Säger: https://github.com/ssaeger
42874300

42884301
.. _YenChen Lin: https://github.com/yenchenlin
4302+
4303+
.. _Nelson Liu: https://github.com/nelson-liu

sklearn/ensemble/forest.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -948,8 +948,10 @@ class RandomForestRegressor(ForestRegressor):
948948
The number of trees in the forest.
949949
950950
criterion : string, optional (default="mse")
951-
The function to measure the quality of a split. The only supported
952-
criterion is "mse" for the mean squared error.
951+
The function to measure the quality of a split. Supported criteria
952+
are "mse" for the mean squared error, which is equal to variance
953+
reduction as feature selection criterion, and "mae" for the mean
954+
absolute error.
953955
954956
max_features : int, float, string or None, optional (default="auto")
955957
The number of features to consider when looking for the best split:
@@ -1300,8 +1302,10 @@ class ExtraTreesRegressor(ForestRegressor):
13001302
The number of trees in the forest.
13011303
13021304
criterion : string, optional (default="mse")
1303-
The function to measure the quality of a split. The only supported
1304-
criterion is "mse" for the mean squared error.
1305+
The function to measure the quality of a split. Supported criteria
1306+
are "mse" for the mean squared error, which is equal to variance
1307+
reduction as feature selection criterion, and "mae" for the mean
1308+
absolute error.
13051309
13061310
max_features : int, float, string or None, optional (default="auto")
13071311
The number of features to consider when looking for the best split:

sklearn/ensemble/gradient_boosting.py

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -720,15 +720,16 @@ class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
720720
"""Abstract base class for Gradient Boosting. """
721721

722722
@abstractmethod
723-
def __init__(self, loss, learning_rate, n_estimators, min_samples_split,
724-
min_samples_leaf, min_weight_fraction_leaf,
723+
def __init__(self, loss, learning_rate, n_estimators, criterion,
724+
min_samples_split, min_samples_leaf, min_weight_fraction_leaf,
725725
max_depth, init, subsample, max_features,
726726
random_state, alpha=0.9, verbose=0, max_leaf_nodes=None,
727727
warm_start=False, presort='auto'):
728728

729729
self.n_estimators = n_estimators
730730
self.learning_rate = learning_rate
731731
self.loss = loss
732+
self.criterion = criterion
732733
self.min_samples_split = min_samples_split
733734
self.min_samples_leaf = min_samples_leaf
734735
self.min_weight_fraction_leaf = min_weight_fraction_leaf
@@ -762,7 +763,7 @@ def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
762763

763764
# induce regression tree on residuals
764765
tree = DecisionTreeRegressor(
765-
criterion='friedman_mse',
766+
criterion=self.criterion,
766767
splitter='best',
767768
max_depth=self.max_depth,
768769
min_samples_split=self.min_samples_split,
@@ -1296,6 +1297,14 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
12961297
of the input variables.
12971298
Ignored if ``max_leaf_nodes`` is not None.
12981299
1300+
criterion : string, optional (default="friedman_mse")
1301+
The function to measure the quality of a split. Supported criteria
1302+
are "friedman_mse" for the mean squared error with improvement
1303+
score by Friedman, "mse" for mean squared error, and "mae" for
1304+
the mean absolute error. The default value of "friedman_mse" is
1305+
generally the best as it can provide a better approximation in
1306+
some cases.
1307+
12991308
min_samples_split : int, float, optional (default=2)
13001309
The minimum number of samples required to split an internal node:
13011310
@@ -1426,7 +1435,7 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
14261435
_SUPPORTED_LOSS = ('deviance', 'exponential')
14271436

14281437
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
1429-
subsample=1.0, min_samples_split=2,
1438+
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
14301439
min_samples_leaf=1, min_weight_fraction_leaf=0.,
14311440
max_depth=3, init=None, random_state=None,
14321441
max_features=None, verbose=0,
@@ -1435,7 +1444,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
14351444

14361445
super(GradientBoostingClassifier, self).__init__(
14371446
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
1438-
min_samples_split=min_samples_split,
1447+
criterion=criterion, min_samples_split=min_samples_split,
14391448
min_samples_leaf=min_samples_leaf,
14401449
min_weight_fraction_leaf=min_weight_fraction_leaf,
14411450
max_depth=max_depth, init=init, subsample=subsample,
@@ -1643,6 +1652,14 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
16431652
of the input variables.
16441653
Ignored if ``max_leaf_nodes`` is not None.
16451654
1655+
criterion : string, optional (default="friedman_mse")
1656+
The function to measure the quality of a split. Supported criteria
1657+
are "friedman_mse" for the mean squared error with improvement
1658+
score by Friedman, "mse" for mean squared error, and "mae" for
1659+
the mean absolute error. The default value of "friedman_mse" is
1660+
generally the best as it can provide a better approximation in
1661+
some cases.
1662+
16461663
min_samples_split : int, float, optional (default=2)
16471664
The minimum number of samples required to split an internal node:
16481665
@@ -1772,15 +1789,15 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
17721789
_SUPPORTED_LOSS = ('ls', 'lad', 'huber', 'quantile')
17731790

17741791
def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
1775-
subsample=1.0, min_samples_split=2,
1792+
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
17761793
min_samples_leaf=1, min_weight_fraction_leaf=0.,
17771794
max_depth=3, init=None, random_state=None,
17781795
max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
17791796
warm_start=False, presort='auto'):
17801797

17811798
super(GradientBoostingRegressor, self).__init__(
17821799
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
1783-
min_samples_split=min_samples_split,
1800+
criterion=criterion, min_samples_split=min_samples_split,
17841801
min_samples_leaf=min_samples_leaf,
17851802
min_weight_fraction_leaf=min_weight_fraction_leaf,
17861803
max_depth=max_depth, init=init, subsample=subsample,

sklearn/ensemble/tests/test_forest.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ def check_boston_criterion(name, criterion):
159159

160160

161161
def test_boston():
162-
for name, criterion in product(FOREST_REGRESSORS, ("mse", )):
162+
for name, criterion in product(FOREST_REGRESSORS, ("mse", "mae", "friedman_mse")):
163163
yield check_boston_criterion, name, criterion
164164

165165

@@ -244,7 +244,7 @@ def test_importances():
244244
for name, criterion in product(FOREST_CLASSIFIERS, ["gini", "entropy"]):
245245
yield check_importances, name, criterion, X, y
246246

247-
for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse"]):
247+
for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse", "mae"]):
248248
yield check_importances, name, criterion, X, y
249249

250250

0 commit comments

Comments
 (0)
0