-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
y_vals = NULL | ||
weights = NULL | ||
y_vals = <double*> calloc(self.n_node_samples, sizeof(double)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the correct way to instantiate some sort of collection object in cython...
So I wrote an initial version of Also, I sprinkled the diff with some comments and general questions I had about Cython, it'd be great if someone could answer them. |
I fixed the |
Hi, sorry for the noise but for the purpose of |
Ah I see, the code you linked goes ahead and sorts the array before applying a similar algorithm to the one I have. I was wondering if there was a way to calculate weighted median without sorting, but it seems reasonable that part of the problem of calculating median itself requires sorting (as you're partitioning into two halves, one of which is lesser than the other). Thanks for the input @maniteja123! |
I implemented a generic quicksort algorithm to sort the array of y values and their weights, and corrected a bug that would cause the median to be incorrectly calculated when you have to take the average of two bounding points. My next steps will be to verify the correctness of the MAE implementation in |
So i ran a few tests where I used toy data and calculated the MAE by hand, and compared my results to the code's results. I'll write unit tests in a bit, but I think |
I just wrote an initial version of |
median_dest[k] = y_vals[median_index] | ||
|
||
|
||
cdef void sort_values_and_weights(self, double* y_vals, double* weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use DOUBLE_t
everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, what are the advantages to using DOUBLE_T
versus double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double
is the standard c double. The size of that is almost always 64 bits. But it is not guaranteed to be so. As it is dependent on the compiler and/or platform.
Here DOUBLE_t
is basically ctypedef cnp.npy_float64 DOUBLE_t
, where we guarantee fix the size to be of 64 bits to prevent unwanted sideeffects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, that makes sense. Thanks for your explanation! Why is double
used vs DOUBLE_t
in other parts of the file, then? Should this be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Leave those as such. Whenever you make a change use DOUBLE_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wait! use DOUBLE_t
for y
and weights
as it will be taken from a numpy array. For anything that won't be taken from/converted back into a numpy array, you are free to use double. That is what is done. Sorry if my previous comment did not clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this com 8000 ment to others. Learn more.
is there any reason not to just use DOUBLE_t
for everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DOUBLE_t
comes with an overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the double / DOUBLE_t as appropriate, can you let me know if I missed anything?
A few comments - |
@rvraghav93 ok, I'll change that. it'll probably require a slight reworking due to inability to use class variables of criterion from utils, but that should be trivial if not a bit verbose. |
It should be something like how splitter.pyx uses the sort helper. For now just put it inside the criterion.pyx let's move it to utils later. I'll comment more specifically by this weekend. |
Sorry for not working on this for a bit, I went through and tinkered with the I moved the sorting functions to In the meanwhile, I'd appreciate if someone could look over my |
for p in range(start, pos): | ||
i = samples[p] | ||
y_ik = y[i * self.y_stride + k] | ||
# impurity_left[0] += (fabs(y_ik - medians[k]) / (pos - start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this works, but it does; if i do it with the statements that are commented out, there's a float rounding error.
49fb192
to
b53c1c5
Compare
I am not familiar enough with the tree code to know if it's possible to do proxy impurity function for MAE. For the segfault on 32 bit Python under windows as reported by appveyor, if you don't have windows at end are too lazy to install it in a VM (you will also need VS 2015 to build scikit-learn with Python 3.5), then you can try to reproduce the issue with anaconda 32 bit for linux (assuming you use linux to develop on scikit-learn). |
@ogrisel thanks for the tip! i installed dependencies through pip on a fresh 32-bit ubuntu VM, and it worked perfectly. Does anaconda do something different, or are they functionally equivalent? |
return ((self.weighted_n_node_samples / self.weighted_n_samples) * F438 | ||
(impurity - (self.weighted_n_right / | ||
(impurity - (self.weighted_n_right / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you avoid modifying lines that are not related to the PR please? It works in your favor by speeding up the review as the diff is smaller...
Yohoo you get a +1!! Amazing job! |
This looks solid and well documented/tested, with some minor improvements to the API of gradient boosting as well. Lets get this GSoC PR merged! I give my +1 as well. |
Merging this! 🍻 |
🍻 |
Yeay! Congratulations
|
Thanks all! 🎆 |
🍻 |
@@ -1296,6 +1297,14 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): | |||
of the input variables. | |||
Ignored if ``max_leaf_nodes`` is not None. | |||
|
|||
criterion : string, optional (default="friedman_mse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late to the party, but this should have a versionadded
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should, i'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
…gressor (scikit-learn#6667) * feature: add initial node_value method * testing code for node_impurity and node_value This code runs into 'Bus Error: 10' at node_value final assignment. * fix: node_value now correctly calculating weighted median for sorted data. Still need to change the code to work with unsorted data. * fix: node_value now correctly calculates median regardless of initial order * fix: correct bug in calculating median when taking midpoint is necessary * feature: add initial version of children_impurity * feature: refactor median calculation into one function * fix: fix use of DOUBLE_t vs double * feature: move helper functions to _utils.pyx, fix mismatched pointer type * fix: fix some bugs in children_impurity method * push a debug version to try to solve segfault * push latest changes, segfault probably happening bc of something in _utils.pyx * fix: fix segfault in median calculation and remove excessive logging * chore: revert some misc spacing changes I accidentally made * chore: one last spacing fix in _splitter.pyx * feature: don't calculate weighted median if no weights are passed in * remove extraneous logging statement * fix: fix children impurity calculation * fix: fix bug with children impurity not being initally set to 0 * fix: hacky fix for a float accuracy error * fix: incorrect type cast in median array generation for node_impurity * slightly tweak node_impurity function * fix: be more explicit with casts * feature: revert cosmetic changes and free temporary arrays * fix: only free weight array in median calcuation if it was created * style: remove extraneous newline / trigger CI build * style: remove extraneous 0 from range * feature: save sorts within a node to speed it up * fix: move parts of dealloc to regression criterion * chore: add comment to splitter to try to force recythonizing * chore: add comment to _tree.pyx to try to force recythonizing * chore: add empty comment to gradient boosting to force recythonizing * fix: fix bug in weighted median * try moving sorted values to a class variable * feature: refactor criterion to sort once initially, then draw all samples from this sorted data * style: remove extraneous parens from if condition * implement median-heap method for calculating impurity * style: remove extra line * style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR * feature: change minmaxheap to internally use sorted arrays * refactored MAE and push to share work * fix errors wrt median insertion case * spurious comment to force recythonization * general code cleanup * fix typo in _tree.pyx * removed some extraneous comments * [ci skip] remove earlier microchanges * [ci skip] remove change to priorityheap * [ci skip] fix indentation * [ci skip] fix class-specific issues with heaps * [ci skip] restore a newline * [ci skip] remove microchange to refactor later * reword a comment * remove heapify methods from queue class * doc: update docstrings for dt, rf, and et regressors * doc: revert incorrect spacing to shorten diff * convert get_median to return value directly * [ci skip] remove accidental whitespace * remove extraneous unpacking of values * style: misc changes to identifiers * add docstrings and more informative variable identifiers * [ci skip] add trivial comments to recythonize * remove trivial comments for recythonizing * force recythonization for real this time * remove trivial comments for recythonization * rfc: harmonize arg. names and remove unnecessary checks * convert allocations to safe_realloc * fix bug in weighted case and add tests for MAE * change all medians to DOUBLE_t * add loginc allocate mediancalculators once, and reset otherwise * misc style fixes * modify cinit of regressioncriterion to take n_samples * add MAE formula and force rebuild bc. travis was down * add criterion parameter to gradient boosting and add forest tests * add entries to what's new
…gressor (scikit-learn#6667) * feature: add initial node_value method * testing code for node_impurity and node_value This code runs into 'Bus Error: 10' at node_value final assignment. * fix: node_value now correctly calculating weighted median for sorted data. Still need to change the code to work with unsorted data. * fix: node_value now correctly calculates median regardless of initial order * fix: correct bug in calculating median when taking midpoint is necessary * feature: add initial version of children_impurity * feature: refactor median calculation into one function * fix: fix use of DOUBLE_t vs double * feature: move helper functions to _utils.pyx, fix mismatched pointer type * fix: fix some bugs in children_impurity method * push a debug version to try to solve segfault * push latest changes, segfault probably happening bc of something in _utils.pyx * fix: fix segfault in median calculation and remove excessive logging * chore: revert some misc spacing changes I accidentally made * chore: one last spacing fix in _splitter.pyx * feature: don't calculate weighted median if no weights are passed in * remove extraneous logging statement * fix: fix children impurity calculation * fix: fix bug with children impurity not being initally set to 0 * fix: hacky fix for a float accuracy error * fix: incorrect type cast in median array generation for node_impurity * slightly tweak node_impurity function * fix: be more explicit with casts * feature: revert cosmetic changes and free temporary arrays * fix: only free weight array in median calcuation if it was created * style: remove extraneous newline / trigger CI build * style: remove extraneous 0 from range * feature: save sorts within a node to speed it up * fix: move parts of dealloc to regression criterion * chore: add comment to splitter to try to force recythonizing * chore: add comment to _tree.pyx to try to force recythonizing * chore: add empty comment to gradient boosting to force recythonizing * fix: fix bug in weighted median * try moving sorted values to a class variable * feature: refactor criterion to sort once initially, then draw all samples from this sorted data * style: remove extraneous parens from if condition * implement median-heap method for calculating impurity * style: remove extra line * style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR * feature: change minmaxheap to internally use sorted arrays * refactored MAE and push to share work * fix errors wrt median insertion case * spurious comment to force recythonization * general code cleanup * fix typo in _tree.pyx * removed some extraneous comments * [ci skip] remove earlier microchanges * [ci skip] remove change to priorityheap * [ci skip] fix indentation * [ci skip] fix class-specific issues with heaps * [ci skip] restore a newline * [ci skip] remove microchange to refactor later * reword a comment * remove heapify methods from queue class * doc: update docstrings for dt, rf, and et regressors * doc: revert incorrect spacing to shorten diff * convert get_median to return value directly * [ci skip] remove accidental whitespace * remove extraneous unpacking of values * style: misc changes to identifiers * add docstrings and more informative variable identifiers * [ci skip] add trivial comments to recythonize * remove trivial comments for recythonizing * force recythonization for real this time * remove trivial comments for recythonization * rfc: harmonize arg. names and remove unnecessary checks * convert allocations to safe_realloc * fix bug in weighted case and add tests for MAE * change all medians to DOUBLE_t * add loginc allocate mediancalculators once, and reset otherwise * misc style fixes * modify cinit of regressioncriterion to take n_samples * add MAE formula and force rebuild bc. travis was down * add criterion parameter to gradient boosting and add forest tests * add entries to what's new
Hey guys, I hope this is the right place for my concern. Since I'm needing the MAPE (mean absolute percentage error) and am willing to implement it, I need some clarification on the implementation of the MAE criteria. Am I correct that the MAE is implemented using the median and therefore should be called median_absolute_error instead of mean_absolute_error? There are also two different classes for mean_abs... and median_abs... in the sklearn.metrics module. Also I think this class calculates 1/n(\sum_i |Y_i - median(Y)|), shouldn't it be median(|Y_i - median(Y)|)? Perhaps I should then focus on a new class based on RegressionCriterion for implementing MAPE ((1 / n)*(\sum_i (|y_i - f_i| / |f_i|)))? Or is this MAE just implemented with f_i = median(Y). Am I correct that for the MSE f_i equals to mean(Y) in 1/n(\sum_i(y_i - f_i)**2)? I hope I didn't misunderstood something here. |
I have the same question as m0uH, I don't understand why median comes into the equation when calculating "mean absolute error"? I thought it should be implemented much the same as MSE but replacing the power 2 for an absolute...? Would really appreciate if somebody could explain why median is used? |
The MAE of an array is minimized by the median of an array. This means that if you have an array and want to choose a single value which minimizes the MAE between that value and the array, it is always the median. In contrast, the MSE of an array is minimized by the mean of an array. |
Great explanation. Can we add it to the docs? |
Thank you jmschrei, this is difficult for me to understand, but I think I now get it with your help. Please could you confirm the following is correct: if I get the mean absolute error and the median absolute error of the following array: we can't use the mean based absolute error because it will always be 0 i.e., splits would all end up with mean absolute error of 0 and also the absolute error across the parent would be 0 too, thus we wouldn't end up with any splits! Is this correct? |
Consider the following:
I am not sure how you got a mean absolute error of 0. In this example you can see that the mean gives a smaller MSE than the median, but that the median gives a smaller MAE than the mean. I hope that helps. |
My apologies, in my rush I had forgotten the most important step - taking the absolute!... :-s Thank you for your example. I have tried numerous example arrays through your mae function and it seems that the median does in fact always appear to produce the lower value error... I assume therefore that the error will always be smaller for the absolute error when using the median rather than the mean? I am probably being thick (like earlier haha), but if my above assumption is true, why does this make the median a better candidate to use than the mean when calculating absolute error? Hopefully, I have got it right this time by saying..., it is because a smaller error means it has fit the data better!? (I am wondering about the intricacies because calculating a "mean" based absolute error would be so much faster to run - but pointless if far inferior?... and/or whether a "_criterion.pyx -> proxy_impurity_improvement" could be loosely based around this?) |
Not @jmschrei , but yes this is correct (well technically they could be equal). Hence why we use it :)
Indeed, the goal of the tree is to grow in a way such that it minimizes the criterion (in this case, MAE) on the train set.
yes, calculating MAE with the "mean" does not actually minimize the MAE. Since our goal is to minimize the MAE, we thus use the median over the mean. In other words, calculating a "mean" based mean absolute error would be much faster, but not guarantee good results because the mean does not minimize the criterion. |
Thank you both for your explanations and time with this. I am very appreciative 👍 . I apologise for slightly hijacking this thread but hope that it helps others who come across this with the same questions? It seems the lack of standard terminology in this domain is where I was mostly tripped up (and others looking through the various related issue threads here). I was incorrectly assuming that "Mean" Absolute Error meant that the value deducted from X, before taking the absolute, was the Mean...I thus thought there was such a thing as a "Median Absolute Error", however, "Mean Absolute Error" actually covers Mean, Mode and Median (or any other measure of central tendency). The "Mean" in MAE actually refers to the fact we divide by N irrespective of the measure of central tendency i.e., as expressed earlier, in bold, is what the "mean" refers to in MAE:
It is perhaps confusing because the most widely used metric: MSE will predominantly use mean as the central tendency; this is where the incorrect assumption can arise (well,... in my case at least). So... It just so happens, the MAE used, when applied here (scikit-learn DecisionTrees), is the "Median" central tendency metric! The other point of confusion is the usual usage of MSE/MAE in learning algorithms, these are usually applied to: For future reference for myself/others, I found the following links which cover most of my confusion with regards to MAE, variance and Decision Trees: https://en.wikipedia.org/wiki/Average_absolute_deviation [please correct me if any of the above is wrong :) ] |
Adding mean absolute error criteria to
tree.DecisionTreeRegressor
Sorry for the long silence, the past few weeks have been busy with the start of a new quarter. Things have settled down, though, and I'm ready to resume contributions.
I spent the past few days reading through and trying to get a handle on the
tree
module, and I've begun looking into implementing the mean absolute error (MAE) split criterion into the DecisionTreeRegressor. I'm creating this WIP PR to provide a public ground for discussion about the code; I believe that feedback early, fail fast would help maximize the amount of learning I can gain from this PR to apply toward future contributions.Here's a task list of sub-objectives (that I see) to complete:
node_value
method to calculate the mediannode_value
in my initial commit, please let me know if I'm on the right track / if there are things I should fix or can improve in functional correctness, efficiency, and style.node_impurity
to return the mean absolute errorchildren_impurity
methodI've never used C / C++ before, so I've been learning and experimenting with C and Cython as well. If you see a segment of code that looks incorrect, please point it out! I'm looking forward to learning more about Cython and C through this PR.
Thanks!