[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667

nelson-liu · 2016-04-15T23:03:56Z

Adding mean absolute error criteria to `tree.DecisionTreeRegressor`

Sorry for the long silence, the past few weeks have been busy with the start of a new quarter. Things have settled down, though, and I'm ready to resume contributions.
I spent the past few days reading through and trying to get a handle on the tree module, and I've begun looking into implementing the mean absolute error (MAE) split criterion into the DecisionTreeRegressor. I'm creating this WIP PR to provide a public ground for discussion about the code; I believe that feedback early, fail fast would help maximize the amount of learning I can gain from this PR to apply toward future contributions.

Here's a task list of sub-objectives (that I see) to complete:

override node_value method to calculate the median
I have an initial version of node_value in my initial commit, please let me know if I'm on the right track / if there are things I should fix or can improve in functional correctness, efficiency, and style.
update node_impurity to return the mean absolute error
write the children_impurity method

I've never used C / C++ before, so I've been learning and experimenting with C and Cython as well. If you see a segment of code that looks incorrect, please point it out! I'm looking forward to learning more about Cython and C through this PR.

Thanks!

nelson-liu · 2016-04-17T20:27:21Z

sklearn/tree/_criterion.pyx

+
+        y_vals = NULL
+        weights = NULL
+        y_vals = <double*> calloc(self.n_node_samples, sizeof(double))


Not sure if this is the correct way to instantiate some sort of collection object in cython...

nelson-liu · 2016-04-17T20:34:08Z

So I wrote an initial version of node_impurity, and fixed several issues in node_value as well. However, I am getting a Bus Error: 10 on line 1019 of _criterion.pyx, when the value of the pointer is assigned. The print statement indicates that the weighted median is being calculated correctly, I'm just unable to store it in the dest pointer. Could anyone let me know what I'm doing incorrectly?

Also, I sprinkled the diff with some comments and general questions I had about Cython, it'd be great if someone could answer them.

nelson-liu · 2016-04-18T03:15:29Z

I fixed the Bus Error: 10, and got node_value to correctly compute the weighted median when the data is sorted. I'm still thinking of ways to best generalize the function to work with unsorted data, as i'm sure there's a better solution than sorting and then running the current algorithm. Does anyone have any ideas?
Also, another current broken-ish behavior is to return a median equal to the first element seen, e.g. med([1,2,3,4]) would return 2. Although this makes no difference in the overall MAE calculation, it's still necessary to fix for exporting the tree to graphviz. This is fairly easy to customize, though.

maniteja123 · 2016-04-18T05:53:22Z

Hi, sorry for the noise but for the purpose of median_absolute_error, as per the interpolation method in wikipedia, there is an implementation to calculate weighted median in here. I am not aware of the intricate details here but I suppose it can be coded in Cython here on the same lines. Hope it helps some way.
EDIT: Changed the link. Sorry about that !

nelson-liu · 2016-04-18T06:04:35Z

Ah I see, the code you linked goes ahead and sorts the array before applying a similar algorithm to the one I have. I was wondering if there was a way to calculate weighted median without sorting, but it seems reasonable that part of the problem of calculating median itself requires sorting (as you're partitioning into two halves, one of which is lesser than the other). Thanks for the input @maniteja123!

nelson-liu · 2016-04-19T03:52:29Z

I implemented a generic quicksort algorithm to sort the array of y values and their weights, and corrected a bug that would cause the median to be incorrectly calculated when you have to take the average of two bounding points. node_value should be functionally correct now, but I'm sure there are ways to make it faster / more efficient.

My next steps will be to verify the correctness of the MAE implementation in node_impurity and write proxy_impurity and children_impurity

nelson-liu · 2016-04-19T21:08:54Z

So i ran a few tests where I used toy data and calculated the MAE by hand, and compared my results to the code's results. I'll write unit tests in a bit, but I think node_impurity seems to be functioning.

nelson-liu · 2016-04-19T22:42:10Z

I just wrote an initial version of children_impurity, but it's untested and unrefined. I also noticed a lot of code overlap between node_value and children_impurity, so I'm considering splitting the median calculation into a separate function. However, the other criterion have a fair amount of overlapping code as well...
I'm also not quite sure how to implement proxy_impurity_improvement, as I can't think of any proxies that would work for MAE yet. Does anyone have any input? perhaps @rvraghav93 @jmschrei @glouppe ?

raghavrv · 2016-04-20T15:13:42Z

sklearn/tree/_criterion.pyx

+                median_dest[k] = y_vals[median_index]
+
+
+    cdef void sort_values_and_weights(self, double* y_vals, double* weights,


Could you use DOUBLE_t everywhere?

Sure, what are the advantages to using DOUBLE_T versus double?

double is the standard c double. The size of that is almost always 64 bits. But it is not guaranteed to be so. As it is dependent on the compiler and/or platform.

Here DOUBLE_t is basically ctypedef cnp.npy_float64 DOUBLE_t, where we ~~guarantee~~ fix the size to be of 64 bits to prevent unwanted sideeffects.

ah ok, that makes sense. Thanks for your explanation! Why is double used vs DOUBLE_t in other parts of the file, then? Should this be changed?

Hmm... Leave those as such. Whenever you make a change use DOUBLE_t.

Ah wait! use DOUBLE_t for y and weights as it will be taken from a numpy array. For anything that won't be taken from/converted back into a numpy array, you are free to use double. That is what is done. Sorry if my previous comment did not clarify that.

Can @glouppe or @jnothman confirm if I am right?

is there any reason not to just use DOUBLE_t for everything?

I think DOUBLE_t comes with an overhead.

I've changed the double / DOUBLE_t as appropriate, can you let me know if I missed anything?

raghavrv · 2016-04-21T12:07:04Z

A few comments - sort_values... and compute_medians... should be unbound helpers probably put inside the utils...

nelson-liu · 2016-04-21T22:01:46Z

@rvraghav93 ok, I'll change that. it'll probably require a slight reworking due to inability to use class variables of criterion from utils, but that should be trivial if not a bit verbose.

raghavrv · 2016-04-21T22:15:28Z

It should be something like how splitter.pyx uses the sort helper. For now just put it inside the criterion.pyx let's move it to utils later. I'll comment more specifically by this weekend.

nelson-liu · 2016-05-12T06:54:53Z

Sorry for not working on this for a bit, I went through and tinkered with the tree module code and Cython for the past two weeks and I feel like i have a much better grasp of both that will help me throughout both this PR and future tree work.

I moved the sorting functions to _utils.pyx and also fixed an issue with an incorrect type cast for double* vs DOUBLE_t*. I ran some quick examples with node_value and node_impurity, and I'm quite confident that they correctly compute the median and MAE, respectively. I'll revisit my implementation of children_impurity and write proxy_impurity_improvement shortly.

In the meanwhile, I'd appreciate if someone could look over my node_value and node_impurity functions and let me know if there's anything I need to change after a cursory glance; it's always nice to have a pair of fresh eyes.

nelson-liu · 2016-05-21T20:51:47Z

sklearn/tree/_criterion.pyx

+            for p in range(start, pos):
+                i = samples[p]
+                y_ik = y[i * self.y_stride + k]
+                # impurity_left[0] += (fabs(y_ik - medians[k]) / (pos - start))


I'm not sure why this works, but it does; if i do it with the statements that are commented out, there's a float rounding error.

ogrisel · 2016-05-26T16:58:01Z

I am not familiar enough with the tree code to know if it's possible to do proxy impurity function for MAE.

For the segfault on 32 bit Python under windows as reported by appveyor, if you don't have windows at end are too lazy to install it in a VM (you will also need VS 2015 to build scikit-learn with Python 3.5), then you can try to reproduce the issue with anaconda 32 bit for linux (assuming you use linux to develop on scikit-learn).

nelson-liu · 2016-05-27T02:02:20Z

@ogrisel thanks for the tip! i installed dependencies through pip on a fresh 32-bit ubuntu VM, and it worked perfectly. Does anaconda do something different, or are they functionally equivalent?
I'm going to try installing the necessary tools to build scikit-learn on a windows VM; initially I had set up cygwin / mingw but I don't think that's the way to do it, as that's essentially emulating linux? I'm currently setting up python 2.7/3.5 without cygwin and using straight powershell.

raghavrv · 2016-05-27T13:15:03Z

sklearn/tree/_criterion.pyx

-
        return ((self.weighted_n_node_samples / self.weighted_n_samples) *
-                (impurity - (self.weighted_n_right / 
+                (impurity - (self.weighted_n_right /


Could you avoid modifying lines that are not related to the PR please? It works in your favor by speeding up the review as the diff is smaller...

raghavrv · 2016-07-22T16:45:39Z

Yohoo you get a +1!! Amazing job!

jmschrei · 2016-07-22T16:48:56Z

This looks solid and well documented/tested, with some minor improvements to the API of gradient boosting as well. Lets get this GSoC PR merged! I give my +1 as well.

glouppe · 2016-07-25T06:45:11Z

Merging this! 🍻

raghavrv · 2016-07-25T07:02:28Z

🍻

GaelVaroquaux · 2016-07-25T08:43:45Z

Yeay! Congratulations

nelson-liu · 2016-07-25T14:35:17Z

Thanks all! 🎆

amueller · 2016-07-25T15:38:44Z

🍻

amueller · 2016-07-28T17:27:27Z

sklearn/ensemble/gradient_boosting.py

        of the input variables.
        Ignored if ``max_leaf_nodes`` is not None.

+    criterion : string, optional (default="friedman_mse")


Sorry for being late to the party, but this should have a versionadded, right?

yes it should, i'll add that.

…gressor (scikit-learn#6667) * feature: add initial node_value method * testing code for node_impurity and node_value This code runs into 'Bus Error: 10' at node_value final assignment. * fix: node_value now correctly calculating weighted median for sorted data. Still need to change the code to work with unsorted data. * fix: node_value now correctly calculates median regardless of initial order * fix: correct bug in calculating median when taking midpoint is necessary * feature: add initial version of children_impurity * feature: refactor median calculation into one function * fix: fix use of DOUBLE_t vs double * feature: move helper functions to _utils.pyx, fix mismatched pointer type * fix: fix some bugs in children_impurity method * push a debug version to try to solve segfault * push latest changes, segfault probably happening bc of something in _utils.pyx * fix: fix segfault in median calculation and remove excessive logging * chore: revert some misc spacing changes I accidentally made * chore: one last spacing fix in _splitter.pyx * feature: don't calculate weighted median if no weights are passed in * remove extraneous logging statement * fix: fix children impurity calculation * fix: fix bug with children impurity not being initally set to 0 * fix: hacky fix for a float accuracy error * fix: incorrect type cast in median array generation for node_impurity * slightly tweak node_impurity function * fix: be more explicit with casts * feature: revert cosmetic changes and free temporary arrays * fix: only free weight array in median calcuation if it was created * style: remove extraneous newline / trigger CI build * style: remove extraneous 0 from range * feature: save sorts within a node to speed it up * fix: move parts of dealloc to regression criterion * chore: add comment to splitter to try to force recythonizing * chore: add comment to _tree.pyx to try to force recythonizing * chore: add empty comment to gradient boosting to force recythonizing * fix: fix bug in weighted median * try moving sorted values to a class variable * feature: refactor criterion to sort once initially, then draw all samples from this sorted data * style: remove extraneous parens from if condition * implement median-heap method for calculating impurity * style: remove extra line * style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR * feature: change minmaxheap to internally use sorted arrays * refactored MAE and push to share work * fix errors wrt median insertion case * spurious comment to force recythonization * general code cleanup * fix typo in _tree.pyx * removed some extraneous comments * [ci skip] remove earlier microchanges * [ci skip] remove change to priorityheap * [ci skip] fix indentation * [ci skip] fix class-specific issues with heaps * [ci skip] restore a newline * [ci skip] remove microchange to refactor later * reword a comment * remove heapify methods from queue class * doc: update docstrings for dt, rf, and et regressors * doc: revert incorrect spacing to shorten diff * convert get_median to return value directly * [ci skip] remove accidental whitespace * remove extraneous unpacking of values * style: misc changes to identifiers * add docstrings and more informative variable identifiers * [ci skip] add trivial comments to recythonize * remove trivial comments for recythonizing * force recythonization for real this time * remove trivial comments for recythonization * rfc: harmonize arg. names and remove unnecessary checks * convert allocations to safe_realloc * fix bug in weighted case and add tests for MAE * change all medians to DOUBLE_t * add loginc allocate mediancalculators once, and reset otherwise * misc style fixes * modify cinit of regressioncriterion to take n_samples * add MAE formula and force rebuild bc. travis was down * add criterion parameter to gradient boosting and add forest tests * add entries to what's new

m0uH · 2016-12-25T21:02:27Z

Hey guys, I hope this is the right place for my concern.

Since I'm needing the MAPE (mean absolute percentage error) and am willing to implement it, I need some clarification on the implementation of the MAE criteria. Am I correct that the MAE is implemented using the median and therefore should be called median_absolute_error instead of mean_absolute_error? There are also two different classes for mean_abs... and median_abs... in the sklearn.metrics module.

Also I think this class calculates 1/n(\sum_i |Y_i - median(Y)|), shouldn't it be median(|Y_i - median(Y)|)?

Perhaps I should then focus on a new class based on RegressionCriterion for implementing MAPE ((1 / n)*(\sum_i (|y_i - f_i| / |f_i|)))? Or is this MAE just implemented with f_i = median(Y). Am I correct that for the MSE f_i equals to mean(Y) in 1/n(\sum_i(y_i - f_i)**2)?

I hope I didn't misunderstood something here.

JohnStott · 2017-03-02T23:18:30Z

I have the same question as m0uH, I don't understand why median comes into the equation when calculating "mean absolute error"? I thought it should be implemented much the same as MSE but replacing the power 2 for an absolute...? Would really appreciate if somebody could explain why median is used?

6B28

jmschrei · 2017-03-02T23:33:45Z

The MAE of an array is minimized by the median of an array. This means that if you have an array and want to choose a single value which minimizes the MAE between that value and the array, it is always the median. In contrast, the MSE of an array is minimized by the mean of an array.

jnothman · 2017-03-03T01:17:40Z

Great explanation. Can we add it to the docs?

JohnStott · 2017-03-03T10:37:06Z

Thank you jmschrei, this is difficult for me to understand, but I think I now get it with your help. Please could you confirm the following is correct:

if I get the mean absolute error and the median absolute error of the following array:
1, 2, 3, 5, 9, 10, 20
I get:
mean absolute error = 0
median absolute error = 15

we can't use the mean based absolute error because it will always be 0 i.e., splits would all end up with mean absolute error of 0 and also the absolute error across the parent would be 0 too, thus we wouldn't end up with any splits! Is this correct?

jmschrei · 2017-03-03T17:15:54Z

Consider the following:

def mse(x, y):
    return ((x-y) ** 2).mean()

def mae(x, y):
    return numpy.abs(x-y).mean()

a = numpy.array([1, 2, 5, 10, 20])

a_mean = a.mean()
a_median = numpy.median(a)

print mse(a, a_mean), mae(a, a_mean)        # 48.24 5.92
print mse(a, a_median), mae(a, a_median)    # 55.00 5.40

I am not sure how you got a mean absolute error of 0. In this example you can see that the mean gives a smaller MSE than the median, but that the median gives a smaller MAE than the mean. I hope that helps.

JohnStott · 2017-03-03T18:54:30Z

My apologies, in my rush I had forgotten the most important step - taking the absolute!... :-s

Thank you for your example.

I have tried numerous example arrays through your mae function and it seems that the median does in fact always appear to produce the lower value error... I assume therefore that the error will always be smaller for the absolute error when using the median rather than the mean?

I am probably being thick (like earlier haha), but if my above assumption is true, why does this make the median a better candidate to use than the mean when calculating absolute error? Hopefully, I have got it right this time by saying..., it is because a smaller error means it has fit the data better!?

(I am wondering about the intricacies because calculating a "mean" based absolute error would be so much faster to run - but pointless if far inferior?... and/or whether a "_criterion.pyx -> proxy_impurity_improvement" could be loosely based around this?)

nelson-liu · 2017-03-03T19:51:02Z

I assume therefore that the error will always be smaller for the absolute error when using the median rather than the mean?

Not @jmschrei , but yes this is correct (well technically they could be equal). Hence why we use it :)

Hopefully, I have got it right this time by saying..., it is because a smaller error means it has fit the data better!?

Indeed, the goal of the tree is to grow in a way such that it minimizes the criterion (in this case, MAE) on the train set.

I am wondering about the intricacies because calculating a "mean" based absolute error would be so much faster to run - but pointless if far inferior

yes, calculating MAE with the "mean" does not actually minimize the MAE. Since our goal is to minimize the MAE, we thus use the median over the mean. In other words, calculating a "mean" based mean absolute error would be much faster, but not guarantee good results because the mean does not minimize the criterion.

JohnStott · 2017-03-04T17:46:05Z

Thank you both for your explanations and time with this. I am very appreciative 👍 . I apologise for slightly hijacking this thread but hope that it helps others who come across this with the same questions?

It seems the lack of standard terminology in this domain is where I was mostly tripped up (and others looking through the various related issue threads here). I was incorrectly assuming that "Mean" Absolute Error meant that the value deducted from X, before taking the absolute, was the Mean...I thus thought there was such a thing as a "Median Absolute Error", however, "Mean Absolute Error" actually covers Mean, Mode and Median (or any other measure of central tendency). The "Mean" in MAE actually refers to the fact we divide by N irrespective of the measure of central tendency i.e., as expressed earlier, in bold, is what the "mean" refers to in MAE:

def mse(x, y):
return ((x-y) ** 2).mean()

def mae(x, y):
return numpy.abs(x-y).mean()

It is perhaps confusing because the most widely used metric: MSE will predominantly use mean as the central tendency; this is where the incorrect assumption can arise (well,... in my case at least).

So... It just so happens, the MAE used, when applied here (scikit-learn DecisionTrees), is the "Median" central tendency metric!

The other point of confusion is the usual usage of MSE/MAE in learning algorithms, these are usually applied to:
X - y
where y is the target variable. However, when we are dealing with decision trees, the y is not the "target" as we are looking at a single variable! It is the central point from which we wish to gauge variance/sd and thus obtain splits to maximise information gain.

For future reference for myself/others, I found the following links which cover most of my confusion with regards to MAE, variance and Decision Trees:

https://en.wikipedia.org/wiki/Average_absolute_deviation
http://www.saedsayad.com/decision_tree_reg.htm

[please correct me if any of the above is wrong :) ]

nelson-liu reviewed Apr 17, 2016
View reviewed changes

raghavrv reviewed Apr 20, 2016
View reviewed changes

nelson-liu reviewed May 21, 2016
View reviewed changes

nelson-liu force-pushed the add_mae_criterion branch from 49fb192 to b53c1c5 Compare May 22, 2016 05:48

raghavrv reviewed May 27, 2016
View reviewed changes

nelson-liu changed the title ~~[MRG+1] Add mean absolute error splitting criterion to DecisionTreeRegressor~~ [MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor Jul 22, 2016

glouppe merged commit c84ff5e into scikit-learn:master Jul 25, 2016

nelson-liu mentioned this pull request Jul 25, 2016

Add mae regression tree #6039

Closed

amueller reviewed Jul 28, 2016
View reviewed changes

jnothman mentioned this pull request Feb 1, 2017

ExtraTreesRegressor criterion='mae' runs substantially longer than mse #8254

Closed

This was referenced Aug 20, 2020

[MRG] MNT remove duplicated call to children_impurity() #18203

Merged

Don't allow criterion='mae' for gradient boosting estimators #18263

Closed

lorentzenchr mentioned this pull request Nov 27, 2021

Request more criterion for random forest regression #5368

Closed

		median_dest[k] = y_vals[median_index]


		cdef void sort_values_and_weights(self, double* y_vals, double* weights,

Uh oh!

[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667

[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667

Uh oh!

Conversation

Uh oh!

Adding mean absolute error criteria to tree.DecisionTreeRegressor

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Adding mean absolute error criteria to `tree.DecisionTreeRegressor`