[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX handle properly missing value in MSE and Friedman-MSE children_impurity #28327

Merged
merged 20 commits into from
Feb 12, 2024

Conversation

adam2392
Copy link
Member

Reference Issues/PRs

Closes: #28316

What does this implement/fix? Explain your changes.

  • missing values contribute to sq_sum_left/right in MSE type criterions. This PR addresses that by getting those values to compute correctly
  • augments existing unit-test to test Friedman-MSE and MSE (note the added unit-test will fail on main)

Any other comments?

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
github-actions bot commented Jan 31, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a23e8aa. Link to the linter CI: here

Signed-off-by: Adam Li <adam2392@gmail.com>
@glemaitre glemaitre changed the title [BUG FIX] Enable MSE and Friedman-MSE compute children_impurity correctly FIX handle properly missing value in MSE and Friedman-MSE children_impurity Jan 31, 2024
doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@glemaitre glemaitre self-requested a review January 31, 2024 22:02
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the tree on Ames and the results remain the same so I don't think that we introduce an additional bug. I think that the current failing test is actually not good enough. It is failing because we have model that has an R2 around 0. I'll check if we can get something better.

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
Comment on lines 1152 to 1156

# The missing samples are assumed to be in
# self.sample_indices[-self.n_missing:] that is
# self.sample_indices[end_non_missing:self.end].
cdef intp_t end_non_missing = self.end - self.n_missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this part inside the if statement since this is only useful for missing values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also remove the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot define a cdef inside an if statement in Cython. Removed the comment tho.

sklearn/tree/_criterion.pyx Show resolved Hide resolved
sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved
sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved
sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved
@glemaitre glemaitre added this to the 1.4.1 milestone Jan 31, 2024
@glemaitre
Copy link
Member

Could you use the following test for that is currently failing:

@pytest.mark.parametrize(
    "make_data, Tree, tolerance",
    [
        (datasets.make_regression, DecisionTreeRegressor, 0.08),
        (datasets.make_classification, DecisionTreeClassifier, 0.03),
    ],
)
@pytest.mark.parametrize("sample_weight_train", [None, "ones"])
def test_missing_values_is_resilience(
    make_data, Tree, sample_weight_train, global_random_seed, tolerance
):
    """Check that trees can deal with missing values and have decent performance."""
    n_samples, n_features = 5_000, 10
    X, y = make_data(
        n_samples=n_samples, n_features=n_features, random_state=global_random_seed
    )

    X_missing = X.copy()
    rng = np.random.RandomState(global_random_seed)
    X_missing[rng.choice([False, True], size=X.shape, p=[0.9, 0.1])] = np.nan
    X_missing_train, X_missing_test, y_train, y_test = train_test_split(
        X_missing, y, random_state=global_random_seed
    )
    if sample_weight_train == "ones":
        sample_weight = np.ones(X_missing_train.shape[0])
    else:
        sample_weight = None

    native_tree = Tree(max_depth=10, random_state=global_random_seed)
    native_tree.fit(X_missing_train, y_train, sample_weight=sample_weight)
    score_native_tree = native_tree.score(X_missing_test, y_test)

    tree_with_imputer = make_pipeline(
        SimpleImputer(), Tree(max_depth=10, random_state=global_random_seed)
    )
    tree_with_imputer.fit(X_missing_train, y_train)
    score_tree_with_imputer = tree_with_imputer.score(X_missing_test, y_test)

    assert abs(score_tree_with_imputer - score_native_tree) < tolerance

Instead to check between missing/no-missing performance, I compare between imputation vs. native support. I check with the global random seed that the test is working in the tolerance given.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member Author
@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @glemaitre!

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
Comment on lines 1152 to 1156

# The missing samples are assumed to be in
# self.sample_indices[-self.n_missing:] that is
# self.sample_indices[end_non_missing:self.end].
cdef intp_t end_non_missing = self.end - self.n_missing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot define a cdef inside an if statement in Cython. Removed the comment tho.

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved
sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Member Author
adam2392 commented Feb 1, 2024

Instead to check between missing/no-missing performance, I compare between imputation vs. native support. I check with the global random seed that the test is working in the tolerance given.

Thanks! This is a much better test. I think it will also help alleviate the issue when adding ExtraTree* to this unit-test.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for catching this one ;)

@lesteve
Copy link
Member
lesteve commented Feb 1, 2024

Thanks! This is a much better test. I think it will also help alleviate the issue when adding ExtraTree* to this unit-test.

So running the tests with all the random seeds locally fails in quite a number of random seeds (roughly 50% of the seeds?):

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience 

@glemaitre is on it to try to create a better dataset for this test

@adam2392
Copy link
Member Author
adam2392 commented Feb 1, 2024

I think if that is the case, we need to prolly lower the NaN missing completely at random rate.

@glemaitre
Copy link
Member

I think if that is the case, we need to prolly lower the NaN missing completely at random rate.

Actually, I think that the dataset created for regression is not really good (we cannot nicely predict on it). Maybe we should craft an easier dataset and split the tests into two test (or create a private function to generate the dataset). I'll have a look to check the discriminating performance of the tree to have less variance.

@glemaitre glemaitre self-requested a review February 2, 2024 16:48
@glemaitre
Copy link
Member

I push a new version of the test where I use make_friedman1 that is pretty interesting because intuitively imputer with the mean will be worth than the native support because of the sine relationship between X and y.

@adam2392
Copy link
Member Author
adam2392 commented Feb 2, 2024

I push a new version of the test where I use make_friedman1 that is pretty interesting because intuitively imputer with the mean will be worth than the native support because of the sine relationship between X and y.

Ah nice! Yeah that is pretty sensible. I also see a more stable test in #27966. Testing now with all global random seeds

@lesteve
Copy link
Member
lesteve commented Feb 8, 2024

I can't get the test_missing_values_is_resilience test to fail on main, what am I missing?

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience -v

Since this is supposed to be a non-regression test, I would expect the test to fail on main ...

@adam2392
Copy link
Member Author
adam2392 commented Feb 8, 2024

Are you referring to the existing test on main or the new one?

@lesteve
Copy link
Member
lesteve commented Feb 8, 2024

Are you referring to the existing test on main or the new one?

test_missing_values_is_resilience. With @glemaitre modifications I thought that this was a non-regression test that was supposed to fail in main but maybe I misunderstood.

@adam2392
Copy link
Member Author
adam2392 commented Feb 8, 2024

test_missing_values_is_resilience. With @glemaitre modifications I thought that this was a non-regression test that was supposed to fail in main but maybe I misunderstood.

Ah I see. test_regression_tree_missing_values_toy changes are the ones that will fail on main because the impurity is still not computed correctly on main.

However, when we change the cython code to make it correct, the test_missing_values_is_resilience was failing. @glemaitre suggested that this test was not actually a good test in testing "missing value support natively vs situation_X", where "situation_X" is some non-missing value situation:

  1. "situation_X" was originally no missing values
  2. "situation_X" is now imputing the missing values naively

In addition, the dataset used originally was not very good (see: #28327 (comment)).

Hope that clarifies!

@lesteve
Copy link
Member
lesteve commented Feb 9, 2024

Thanks for the explanation that clarifies things a lot!

I pushed a small simplification to do a classification dataset based on friedman1 (there may be an easier way). That allows to remove the tolerance thing which I did not like too much and found a bit hard to follow.

I made sure locally that the test pass for all the seeds, but if you can double-check that would be great:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED='all' pytest sklearn/tree/tests/test_tree.py -k is_resilience -s

Feel free to improve it if you want!

Another thing I am wondering about: rather than only checking that the mse is positive, would it make sense to make a simple test like the one in #28316 where we theoretically know what the impurity should be and we make sure that this is the case.

@adam2392
Copy link
Member Author
adam2392 commented Feb 9, 2024

I pushed a small simplification to do a classification dataset based on friedman1 (there may be an easier way). That allows to remove the tolerance thing which I did not like too much and found a bit hard to follow.

I made sure locally that the test pass for all the seeds, but if you can double-check that would be great:

Thanks! Will take a look this weekend.

Another thing I am wondering about: rather than only checking that the mse is positive, would it make sense to make a simple test like the one in #28316 where we theoretically know what the impurity should be and we make sure that this is the case.

I think it's possibly a good unit-test to add in addition to what we have. The existing test test_regression_tree_missing_values_toy is a test that implicitly ensures we do not compute negative MSE values due to improperly handling the missing values in both right/left children. This is imo a separate issue from testing that MSE is computed correctly. As an example: Before this PR, the MSE is numerically computed correctly but only when missing values go to the left. So the test would be insufficient.

Also, I'm planning on extending this test in #27966, so for extra trees, I think it would be difficult to compute theoretically the true MSE since it's random. WDYT is the best approach here?

@lesteve
Copy link
Member
lesteve commented Feb 10, 2024

In my last commit, I have pushed a tweak to check that the impurity match after the first split with a tree fitted without NaN, so at least that's a cheap way to sanity check impurity calculation. Not sure if there is an way to generalise the idea for deeper tree levels.

@@ -2658,7 +2658,10 @@ def test_regression_tree_missing_values_toy(X, criterion):
y = np.arange(6)

tree = DecisionTreeRegressor(criterion=criterion, random_state=0).fit(X, y)
assert all(tree.tree_.impurity >= 0) # MSE should always be positive
tree_ref = clone(tree).fit(y.reshape(-1, 1), y)
# assert all(tree.tree_.impurity >= 0) # MSE should always be positive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve are you sure that this was not failing on main. Indeed, this is the example that I and @adam2392 used to be sure that the impurity of the tree make sense. The error in the statistics implied a negative MSE that is impossible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not mean to comment this test ...

@lesteve
Copy link
Member
lesteve commented Feb 12, 2024

I have to admit I did find the time to understand the tree code well enough to be 100% confident that the code change is correct.

I am fine with merging this as is since the additional test where the NaNs goes to the right node makes complete sense.

I would also be completely fine to revert the additional sanity-check I have added in 55e96af.

@glemaitre
Copy link
Member

@lesteve I'm fine with the changes that you have done.

@lesteve
Copy link
Member
lesteve commented Feb 12, 2024

OK let's merge this then, thanks @adam2392 for the fix!

@lesteve lesteve merged commit ba59a17 into scikit-learn:main Feb 12, 2024
30 checks passed
LeoGrin pushed a commit to LeoGrin/scikit-learn that referenced this pull request Feb 12, 2024
…mpurity` (scikit-learn#28327)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 13, 2024
…mpurity` (scikit-learn#28327)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
glemaitre added a commit that referenced this pull request Feb 13, 2024
…mpurity` (#28327)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
@adam2392 adam2392 deleted the regtree branch February 15, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TreeRegressors with MSE Criterion do not correctly handle missing-values
3 participants