-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Description
Describe the bug
I found this bug when analyzing the PR/issue from #28295 and working on #27966.
Essentially, this bug is only found in RegressorCriterion because there one handles additionally the square of the y variables encoded in the variable sq_sum_total. However, this does not handle correctly when missing values are sent to the left node, even though the sum_left and sum_right are handled correctly.
The source of this issue lies in the children_impurity function.
The following fix should be valid:
for p in range(start, pos):
i = sample_indices[p]
if sample_weight is not None:
w = sample_weight[i]
for k in range(self.n_outputs):
y_ik = self.y[i, k]
sq_sum_left += w * y_ik * y_ik
# added lines of code necessary to compute correct missing value
if self.missing_go_to_left:
for p in range(end_non_missing, self.end):
i = sample_indices[p]
if sample_weight is not None:
w = sample_weight[i]
for k in range(self.n_outputs):
y_ik = self.y[i, k]
sq_sum_left += w * y_ik * y_ik
# continue with the rest of the function
sq_sum_right = self.sq_sum_total - sq_sum_leftHowever, this issue is quite deep because it not only affects MSE, but also any Regression Criterion that requires an additional number to compute the relevant Criterion.
I am happy to submit a PR with some relevant tests. This is quite an interesting issue because it doesn't seem to "affect" performance of DecisionTreeRegressor in unit-tests, but intuitively I would've expected it to. It's possible that we might even see a performance increase in MSE. Note that the fix will also help in the PR #27966 because I surmise it will improve performance of ExtaTrees
Steps/Code to Reproduce
# With this dataset, the missing values will always be sent to the left child
# at the first split. The leaf will be pure.
X = np.array([np.nan, 2, np.nan, 4, 5, 6]).reshape(-1, 1)
# X = np.array([np.nan, 2, 3, 4, 5, 6]).reshape(-1, 1)
y = np.arange(6)
tree = DecisionTreeRegressor(random_state=2, max_depth=1).fit(X, y)
from sklearn.tree import plot_tree
dtree = plot_tree(tree)Expected Results
Actual Results
Versions
1.5.dev0
