FIX handle properly missing value in MSE and Friedman-MSE `children_impurity` #28327

adam2392 · 2024-01-31T18:30:55Z

Reference Issues/PRs

Closes: #28316

What does this implement/fix? Explain your changes.

missing values contribute to sq_sum_left/right in MSE type criterions. This PR addresses that by getting those values to compute correctly
augments existing unit-test to test Friedman-MSE and MSE (note the added unit-test will fail on main)

Any other comments?

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions · 2024-01-31T18:32:07Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a23e8aa. Link to the linter CI: here}

Signed-off-by: Adam Li <adam2392@gmail.com>

doc/whats_new/v1.5.rst

sklearn/tree/tests/test_tree.py

Signed-off-by: Adam Li <adam2392@gmail.com>

doc/whats_new/v1.4.rst

glemaitre

I checked the tree on Ames and the results remain the same so I don't think that we introduce an additional bug. I think that the current failing test is actually not good enough. It is failing because we have model that has an R2 around 0. I'll check if we can get something better.

doc/whats_new/v1.4.rst

glemaitre · 2024-01-31T22:13:39Z

sklearn/tree/_criterion.pyx


+        # The missing samples are assumed to be in
+        # self.sample_indices[-self.n_missing:] that is
+        # self.sample_indices[end_non_missing:self.end].
+        cdef intp_t end_non_missing = self.end - self.n_missing


Let's move this part inside the if statement since this is only useful for missing values.

You can also remove the comment

We cannot define a cdef inside an if statement in Cython. Removed the comment tho.

sklearn/tree/_criterion.pyx

sklearn/tree/_splitter.pyx

sklearn/tree/tests/test_tree.py

glemaitre · 2024-01-31T23:00:14Z

Could you use the following test for that is currently failing:

@pytest.mark.parametrize(
    "make_data, Tree, tolerance",
    [
        (datasets.make_regression, DecisionTreeRegressor, 0.08),
        (datasets.make_classification, DecisionTreeClassifier, 0.03),
    ],
)
@pytest.mark.parametrize("sample_weight_train", [None, "ones"])
def test_missing_values_is_resilience(
    make_data, Tree, sample_weight_train, global_random_seed, tolerance
):
    """Check that trees can deal with missing values and have decent performance."""
    n_samples, n_features = 5_000, 10
    X, y = make_data(
        n_samples=n_samples, n_features=n_features, random_state=global_random_seed
    )

    X_missing = X.copy()
    rng = np.random.RandomState(global_random_seed)
    X_missing[rng.choice([False, True], size=X.shape, p=[0.9, 0.1])] = np.nan
    X_missing_train, X_missing_test, y_train, y_test = train_test_split(
        X_missing, y, random_state=global_random_seed
    )
    if sample_weight_train == "ones":
        sample_weight = np.ones(X_missing_train.shape[0])
    else:
        sample_weight = None

    native_tree = Tree(max_depth=10, random_state=global_random_seed)
    native_tree.fit(X_missing_train, y_train, sample_weight=sample_weight)
    score_native_tree = native_tree.score(X_missing_test, y_test)

    tree_with_imputer = make_pipeline(
        SimpleImputer(), Tree(max_depth=10, random_state=global_random_seed)
    )
    tree_with_imputer.fit(X_missing_train, y_train)
    score_tree_with_imputer = tree_with_imputer.score(X_missing_test, y_test)

    assert abs(score_tree_with_imputer - score_native_tree) < tolerance

Instead to check between missing/no-missing performance, I compare between imputation vs. native support. I check with the global random seed that the test is working in the tolerance given.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

adam2392

Thanks for the review @glemaitre!

doc/whats_new/v1.4.rst

adam2392 · 2024-02-01T03:30:10Z

sklearn/tree/_criterion.pyx


+        # The missing samples are assumed to be in
+        # self.sample_indices[-self.n_missing:] that is
+        # self.sample_indices[end_non_missing:self.end].
+        cdef intp_t end_non_missing = self.end - self.n_missing


We cannot define a cdef inside an if statement in Cython. Removed the comment tho.

sklearn/tree/_splitter.pyx

sklearn/tree/tests/test_tree.py

Signed-off-by: Adam Li <adam2392@gmail.com>

…to regtree

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-02-01T03:38:35Z

Instead to check between missing/no-missing performance, I compare between imputation vs. native support. I check with the global random seed that the test is working in the tolerance given.

Thanks! This is a much better test. I think it will also help alleviate the issue when adding ExtraTree* to this unit-test.

…to regtree

Signed-off-by: Adam Li <adam2392@gmail.com>

glemaitre

LGTM. Thanks for catching this one ;)

lesteve · 2024-02-01T16:58:43Z

Thanks! This is a much better test. I think it will also help alleviate the issue when adding ExtraTree* to this unit-test.

So running the tests with all the random seeds locally fails in quite a number of random seeds (roughly 50% of the seeds?):

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience

@glemaitre is on it to try to create a better dataset for this test

adam2392 · 2024-02-01T17:05:17Z

I think if that is the case, we need to prolly lower the NaN missing completely at random rate.

glemaitre · 2024-02-01T17:07:06Z

I think if that is the case, we need to prolly lower the NaN missing completely at random rate.

Actually, I think that the dataset created for regression is not really good (we cannot nicely predict on it). Maybe we should craft an easier dataset and split the tests into two test (or create a private function to generate the dataset). I'll have a look to check the discriminating performance of the tree to have less variance.

glemaitre · 2024-02-02T17:32:45Z

I push a new version of the test where I use make_friedman1 that is pretty interesting because intuitively imputer with the mean will be worth than the native support because of the sine relationship between X and y.

adam2392 · 2024-02-02T17:42:32Z

I push a new version of the test where I use make_friedman1 that is pretty interesting because intuitively imputer with the mean will be worth than the native support because of the sine relationship between X and y.

Ah nice! Yeah that is pretty sensible. I also see a more stable test in #27966. Testing now with all global random seeds

lesteve · 2024-02-08T14:38:25Z

I can't get the test_missing_values_is_resilience test to fail on main, what am I missing?

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience -v

Since this is supposed to be a non-regression test, I would expect the test to fail on main ...

adam2392 · 2024-02-08T16:32:59Z

Are you referring to the existing test on main or the new one?

lesteve · 2024-02-08T17:15:07Z

Are you referring to the existing test on main or the new one?

test_missing_values_is_resilience. With @glemaitre modifications I thought that this was a non-regression test that was supposed to fail in main but maybe I misunderstood.

adam2392 · 2024-02-08T17:31:54Z

test_missing_values_is_resilience. With @glemaitre modifications I thought that this was a non-regression test that was supposed to fail in main but maybe I misunderstood.

Ah I see. test_regression_tree_missing_values_toy changes are the ones that will fail on main because the impurity is still not computed correctly on main.

However, when we change the cython code to make it correct, the test_missing_values_is_resilience was failing. @glemaitre suggested that this test was not actually a good test in testing "missing value support natively vs situation_X", where "situation_X" is some non-missing value situation:

"situation_X" was originally no missing values
"situation_X" is now imputing the missing values naively

In addition, the dataset used originally was not very good (see: #28327 (comment)).

Hope that clarifies!

lesteve · 2024-02-09T15:53:03Z

Thanks for the explanation that clarifies things a lot!

I pushed a small simplification to do a classification dataset based on friedman1 (there may be an easier way). That allows to remove the tolerance thing which I did not like too much and found a bit hard to follow.

I made sure locally that the test pass for all the seeds, but if you can double-check that would be great:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience -s

Feel free to improve it if you want!

Another thing I am wondering about: rather than only checking that the mse is positive, would it make sense to make a simple test like the one in #28316 where we theoretically know what the impurity should be and we make sure that this is the case.

adam2392 · 2024-02-09T18:47:45Z

I pushed a small simplification to do a classification dataset based on friedman1 (there may be an easier way). That allows to remove the tolerance thing which I did not like too much and found a bit hard to follow.

I made sure locally that the test pass for all the seeds, but if you can double-check that would be great:

Thanks! Will take a look this weekend.

Another thing I am wondering about: rather than only checking that the mse is positive, would it make sense to make a simple test like the one in #28316 where we theoretically know what the impurity should be and we make sure that this is the case.

I think it's possibly a good unit-test to add in addition to what we have. The existing test test_regression_tree_missing_values_toy is a test that implicitly ensures we do not compute negative MSE values due to improperly handling the missing values in both right/left children. This is imo a separate issue from testing that MSE is computed correctly. As an example: Before this PR, the MSE is numerically computed correctly but only when missing values go to the left. So the test would be insufficient.

Also, I'm planning on extending this test in #27966, so for extra trees, I think it would be difficult to compute theoretically the true MSE since it's random. WDYT is the best approach here?

lesteve · 2024-02-10T03:54:15Z

In my last commit, I have pushed a tweak to check that the impurity match after the first split with a tree fitted without NaN, so at least that's a cheap way to sanity check impurity calculation. Not sure if there is an way to generalise the idea for deeper tree levels.

glemaitre · 2024-02-10T17:16:45Z

sklearn/tree/tests/test_tree.py

@@ -2658,7 +2658,10 @@ def test_regression_tree_missing_values_toy(X, criterion):
    y = np.arange(6)

    tree = DecisionTreeRegressor(criterion=criterion, random_state=0).fit(X, y)
-    assert all(tree.tree_.impurity >= 0)  # MSE should always be positive
+    tree_ref = clone(tree).fit(y.reshape(-1, 1), y)
+    # assert all(tree.tree_.impurity >= 0)  # MSE should always be positive


@lesteve are you sure that this was not failing on main. Indeed, this is the example that I and @adam2392 used to be sure that the impurity of the tree make sense. The error in the statistics implied a negative MSE that is impossible.

Sorry I did not mean to comment this test ...

lesteve · 2024-02-12T12:52:02Z

I have to admit I did find the time to understand the tree code well enough to be 100% confident that the code change is correct.

I am fine with merging this as is since the additional test where the NaNs goes to the right node makes complete sense.

I would also be completely fine to revert the additional sanity-check I have added in 55e96af.

glemaitre · 2024-02-12T15:22:13Z

@lesteve I'm fine with the changes that you have done.

lesteve · 2024-02-12T15:35:03Z

OK let's merge this then, thanks @adam2392 for the fix!

…mpurity` (scikit-learn#28327) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

…mpurity` (#28327) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Fix bug and add unit test

8a28e68

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions bot added module:tree cython labels Jan 31, 2024

Add changelog entry

4cf7bef

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix lint

02fd866

Signed-off-by: Adam Li <adam2392@gmail.com>

glemaitre changed the title ~~[BUG FIX] Enable MSE and Friedman-MSE compute children_impurity correctly~~ FIX handle properly missing value in MSE and Friedman-MSE children_impurity Jan 31, 2024

glemaitre reviewed Jan 31, 2024

View reviewed changes

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved

glemaitre reviewed Jan 31, 2024

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

adam2392 added 3 commits January 31, 2024 13:37

Changelog and fix build

2ecdffe

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix unit test

0fc8f58

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into regtree

f2a7364

glemaitre self-requested a review January 31, 2024 22:02

glemaitre reviewed Jan 31, 2024

View reviewed changes 10000

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

glemaitre reviewed Jan 31, 2024

View reviewed changes

glemaitre added the Regression label Jan 31, 2024

glemaitre added this to the 1.4.1 milestone Jan 31, 2024

Apply suggestions from code review

a4b2f43

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

adam2392 commented Feb 1, 2024

View reviewed changes

adam2392 added 3 commits January 31, 2024 22:34

Merge

442968b

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'regtree' of https://github.com/adam2392/scikit-learn in…

6f070c2

…to regtree

Change unit test according to Guillame

4782b8a

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 3 commits January 31, 2024 22:38

Merge branch 'main' into regtree

cfb3ad7

Merge branch 'regtree' of https://github.com/adam2392/scikit-learn in…

284a450

…to regtree

Fix unit test docstiring

13ddf83

Signed-off-by: Adam Li <adam2392@gmail.com>

glemaitre approved these changes Feb 1, 2024

View reviewed changes

glemaitre self-requested a review February 2, 2024 16:48

TST improve regression test

e1fe9be

Merge branch 'main' into regtree

536dc83

adam2392 added 2 commits February 2, 2024 18:00

Merge branch 'main' into regtree

a11ed82

Merge branch 'main' into regtree

a9c1a5b

glemaitre approved these changes Feb 6, 2024

View reviewed changes

Simplify test

8bede90

Check impurity match after first split

55e96af

glemaitre reviewed Feb 10, 2024

View reviewed changes

Uncomment test

a23e8aa

lesteve merged commit ba59a17 into scikit-learn:main Feb 12, 2024

glemaitre added a commit that referenced this pull request Feb 13, 2024

FIX handle properly missing value in MSE and Friedman-MSE `children_i…

e4d7d9a

…mpurity` (#28327) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

adam2392 deleted the regtree branch February 15, 2024 17:08

Uh oh!

FIX handle properly missing value in MSE and Friedman-MSE children_impurity #28327

FIX handle properly missing value in MSE and Friedman-MSE children_impurity #28327

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX handle properly missing value in MSE and Friedman-MSE `children_impurity` #28327

FIX handle properly missing value in MSE and Friedman-MSE `children_impurity` #28327