TreeRegressors with MSE Criterion do not correctly handle missing-values

Describe the bug

I found this bug when analyzing the PR/issue from #28295 and working on #27966.

Essentially, this bug is only found in RegressorCriterion because there one handles additionally the square of the y variables encoded in the variable sq_sum_total. However, this does not handle correctly when missing values are sent to the left node, even though the sum_left and sum_right are handled correctly.

The source of this issue lies in the children_impurity function.

The following fix should be valid:

        for p in range(start, pos):
            i = sample_indices[p]

            if sample_weight is not None:
                w = sample_weight[i]

            for k in range(self.n_outputs):
                y_ik = self.y[i, k]
                sq_sum_left += w * y_ik * y_ik
     
        # added lines of code necessary to compute correct missing value 
        if self.missing_go_to_left:
            for p in range(end_non_missing, self.end):
                i = sample_indices[p]
                if sample_weight is not None:
                    w = sample_weight[i]

                for k in range(self.n_outputs):
                    y_ik = self.y[i, k]
                    sq_sum_left += w * y_ik * y_ik
        
        # continue with the rest of the function
        sq_sum_right = self.sq_sum_total - sq_sum_left

However, this issue is quite deep because it not only affects MSE, but also any Regression Criterion that requires an additional number to compute the relevant Criterion.

I am happy to submit a PR with some relevant tests. This is quite an interesting issue because it doesn't seem to "affect" performance of DecisionTreeRegressor in unit-tests, but intuitively I would've expected it to. It's possible that we might even see a performance increase in MSE. Note that the fix will also help in the PR #27966 because I surmise it will improve performance of ExtaTrees

Steps/Code to Reproduce

# With this dataset, the missing values will always be sent to the left child
# at the first split. The leaf will be pure.
X = np.array([np.nan, 2, np.nan, 4, 5, 6]).reshape(-1, 1)
# X = np.array([np.nan, 2, 3, 4, 5, 6]).reshape(-1, 1)
y = np.arange(6)

tree = DecisionTreeRegressor(random_state=2, max_depth=1).fit(X, y)

from sklearn.tree import plot_tree
dtree = plot_tree(tree)

Expected Results

Actual Results

Versions

1.5.dev0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TreeRegressors with MSE Criterion do not correctly handle missing-values #28316

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TreeRegressors with MSE Criterion do not correctly handle missing-values #28316

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions