8000 GitHub · Where software is built
[go: up one dir, main page]

Skip to content

TreeRegressors with MSE Criterion do not correctly handle missing-values #28316

@adam2392

Description

@adam2392

Describe the bug

I found this bug when analyzing the PR/issue from #28295 and working on #27966.

Essentially, this bug is only found in RegressorCriterion because there one handles additionally the square of the y variables encoded in the variable sq_sum_total. However, this does not handle correctly when missing values are sent to the left node, even though the sum_left and sum_right are handled correctly.

The source of this issue lies in the children_impurity function.

The following fix should be valid:

        for p in range(start, pos):
            i = sample_indices[p]

            if sample_weight is not None:
                w = sample_weight[i]

            for k in range(self.n_outputs):
                y_ik = self.y[i, k]
                sq_sum_left += w * y_ik * y_ik
     
        # added lines of code necessary to compute correct missing value 
        if self.missing_go_to_left:
            for p in range(end_non_missing, self.end):
                i = sample_indices[p]
                if sample_weight is not None:
                    w = sample_weight[i]

                for k in range(self.n_outputs):
                    y_ik = self.y[i, k]
                    sq_sum_left += w * y_ik * y_ik
        
        # continue with the rest of the function
        sq_sum_right = self.sq_sum_total - sq_sum_left

However, this issue is quite deep because it not only affects MSE, but also any Regression Criterion that requires an additional number to compute the relevant Criterion.

I am happy to submit a PR with some relevant tests. This is quite an interesting issue because it doesn't seem to "affect" performance of DecisionTreeRegressor in unit-tests, but intuitively I would've expected it to. It's possible that we might even see a performance increase in MSE. Note that the fix will also help in the PR #27966 because I surmise it will improve performance of ExtaTrees

Steps/Code to Reproduce

# With this dataset, the missing values will always be sent to the left child
# at the first split. The leaf will be pure.
X = np.array([np.nan, 2, np.nan, 4, 5, 6]).reshape(-1, 1)
# X = np.array([np.nan, 2, 3, 4, 5, 6]).reshape(-1, 1)
y = np.arange(6)

tree = DecisionTreeRegressor(random_state=2, max_depth=1).fit(X, y)

from sklearn.tree import plot_tree
dtree = plot_tree(tree)

Expected Results

image

Actual Results

image

Versions

1.5.dev0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0