[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH detect categorical polars columns in HistGradientBoosting #27835

Merged

Conversation

jeromedockes
Copy link
Contributor
@jeromedockes jeromedockes commented Nov 23, 2023

Reference Issues/PRs

This extends #26411 so that the option categorical_features="from_dtype" in HistGradientBoosting also works when the input is a polars dataframe

What does this implement/fix? Explain your changes.

this is a POC after (oral) discussion of the feasibility of generalizing #26411 with the dataframe interchange protocol

Any other comments?

Meta-issue for polars / pyarrow support: #25896.

@jeromedockes jeromedockes marked this pull request as draft November 23, 2023 14:28
Copy link
github-actions bot commented Nov 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f244d29. Link to the linter CI: here

categorical_columns_mask = np.asarray(
[
c.dtype[0].name == "CATEGORICAL"
and c.describe_categorical["is_dictionary"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM in polars, pandas and pyarrow from_dataframe raises a NotImplementedError for categorical columns for which is_dictionary is false; columns produced by pandas and polars always have is_dictionary True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of is_dictionary? Maybe you could add an educational inline comment to explain why expect this to be True for our use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns produced by pandas and polars always have is_dictionary True

What about pyarrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pyarrow?

AFAICT for pyarrow as well "is_dictionary" is always True, both in the columns it produces and accepts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the protocol's description for is_dictionary is:

whether a mapping of categorical values to other objects exists

when is_dictionary is True, the categorical column's attribute describe_categorical, which is a dict, has a value for "categories", which is a Column providing an implicit mapping between mapping from int (the indices) to categories

>>> import pandas as pd
>>> df = pd.DataFrame({"A": pd.Series(["b", "a"], dtype="category")})
>>> df["A"].cat.categories
Index(['a', 'b'], dtype='object')
>>> int_col = df.__dataframe__().get_column(0)
>>> int_col.describe_categorical['is_dictionary']
True
>>> int_col.describe_categorical['categories']._col
0    a
1    b
dtype: object

so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"].

I suppose when is_dictionary is False the category values are stored directly in the column (and describe_categorical["categories"] is None), but I don't have an example.
In that case we could get the categories by loading the column, filtering nulls and getting unique values.
That is what is done for non-dataframe inputs, but the code that does it assumes the values are numbers so we may need to adapt it.

that is my impression but I may be misunderstanding the interchange protocol 😅 as I have no experience with it

Copy link
Member
@ogrisel ogrisel Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"].

I agree with your analysis but unfortunately I did not find an easy way to re-construct the list of unique categories as a numpy array (e.g. using numpy.from_dlpack or even a Python list from the public dataframe interchange API only).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't either (without using pyarrow). Then I guess we should give up on using the interchage protocol to get the categories, and use it only to identify categorical columns. We can then get the categories by loading the column from the original dataframe into a numpy array, as is done in main. In that case we don't care about the value of "is_dictionary".
(As I mentioned I have the impression the easiest way to do it is to let the OrdinalEncoder find the categories.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Then we can later improve the implementation of category introspection in OrdinalEncoder a separate, independent.

X.__dataframe__()
.get_column(f_idx)
.describe_categorical["categories"]
._col
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on the fact that both pandas and polars happen to store the original Series in ._col is definitely not great. It is the strategy used in pandas.from_dataframe for categorical columns. pyarrow has a buffers_to_array function that can actually load the buffers from a dataframe api interchange Column

another option would be to not rely on the dataframe interchange protocol to find the categories, but only to discover the categorical columns. Those can then be loaded from X with np.asarray and then we can remove missing values and duplicates in a way similar to what is done below when X is not a dataframe.
the code would be different than what is done for non-dataframe X because that only works for numbers (eg it relies on np.isnan) but polars categories are always strings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is quite suboptimal. Decoding the categories array from the raw buffers seems non-trivial with numpy.

Maybe @jorisvandenbossche or @MarcoGorelli have a better suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we could use the OrdinalEncoder with categories="auto", and check that the number of categories is smaller than max_bins after doing the preprocessing, by checking its categories_ fitted attribute?

otherwise I have the impression we implement the logic for finding categories and handling missing values twice, once here and once in the OrdinalEncoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case here we would just need the interchange protocol to discover which columns are categorical. and rather than implementing it in the gradient boosting maybe the compose.make_column_selector could be generalized a bit to allow selecting by interchange DTypeKind, so that the preprocessing done in the gradient boosting would rely more on ColumnTransformer, OrdinalEncoder and make_column_selector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good plan to me. @thomasjpfan any opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving make_column_selector to select the categories makes sense. I would open another PR to see what it looks like in code.

maybe we could use the OrdinalEncoder with categories="auto",

This would most likely work. Originally, I used pandas's unique function to find the unique categories, because it was faster than having OrdinalEncoder automatically do it. In the end, I wanted to updated OridinalEncoder to have a fast path for pandas series.

Note, the issue with grabbing the categories from the dtype, is that it is a superset of the actual categories. For example, the input data could consistent of 2 unique categories, but the categorical dtype can consistent of 300 categories. OridinalEncoder's current behavior is only to discover the categories that are actually in the data.

from pandas.api.types import CategoricalDtype
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

X = pd.DataFrame(
    {
        "f1": pd.Series(["c1", "c2"], dtype=CategoricalDtype(categories=[f"c{i}" for i in range(300)]))
    }
)

enc = OrdinalEncoder().fit(X)
print(enc.categories_)
# [array(['c1', 'c2'], dtype=object)]

Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point and another reason for not relying on the dtype but on the actual column values.

Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins.

Indeed that makes more sense. A similar issue I noticed is that in the main branch if a categorical column contains values greater than max_bins (before re-encoding them with the OrdinalEncoder), this check raises a ValueError. However as they get re-encoded later, I think what matters is the number of actual categories (the number of unique values) rather than the max of the column.

import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor

X = np.asarray([0, 1, 1000])[:, None]
y = [0., 0., 0.]
gb = HistGradientBoostingRegressor(categorical_features=[0])
gb.fit(X, y)
# ValueError: Categorical feature at index 0 is expected to be encoded with values < 255 but the largest value for the encoded categories is 1000.

If I remove the check fit and predict work as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point and another reason for not relying on the dtype but on the actual column values.

At the same time, I find that having dimensions of model parameters, shape of the output of transform or predict_proba depend on the observed values in X or y is a bit annoying, especially when trying to consolidate models fit on resampled datasets with rare categories in features or classes in the target. I recently faced the y-variant of this issue when experimenting with a new kind of meta estimator for quantile regression:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/quantile_regression_as_classification.ipynb

In the predict_quantiles method, I have to post-process the output of predict_proba to deal with unobserved classes.

When the scikit-learn API was designed, we only had numpy arrays in mind. The expected number of classes (and later categories in features) was never present in numpy dtypes, hence we naturally had to infer it implicitly from the training set values.

But now that data science is moving more towards dataframe with richer dtype metadata, maybe this design decision is counter productive.

For classes / predict_proba, I think we don't want to change the current behavior, at least not in the short term.

But when adding dataframe support for categorical feature aware transformers / estimators, maybe we should have an option to collapse categories to the observed ones or preserve all categories specified in the dtype, even when unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can be annoying to need to perform this kind of post-processing. Although for the specific case of categorical features (not outputs) for the gradient boosting, I don't see which public attributes or method outputs would be changed by taking into account categories that do not show up in the training data?

Copy link
Member
@ogrisel ogrisel Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed my reflection was either for the categorical transformers and maybe later for categorical-encoded y for classsifiers (which we do not handle in any special way as of now).

For categorical feature in HGBDT, the PR is good the way it is (delegating to OrdinalEncoder and only use the observed categories observed in the training set).

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this PR the way it is now.

I think the optimization categories = X.iloc[:, f_idx].unique().dropna().to_numpy() for pandas dataframes should be readded to OrdinalEncoder in a follow-up PR if categorical dtype inspection is either to challenging to implement correctly with the DF interchange API or too disruptive in terms of change of behavior (as discussed in above threads).

Before considering a merge of this PR, I would just like to add pyarrow support in the tests as detailed in the comments below:

@ogrisel
Copy link
Member
ogrisel commented Dec 1, 2023

Also @jeromedockes please remove the WIP/Draft markers of this PR unless you still have TODOs in mind. In which case, please state them explicitly in a bullet list in the description of the PR.

@jeromedockes jeromedockes changed the title [WIP] detect categorical polars columns in HistGradientBoosting detect categorical polars columns in HistGradientBoosting Dec 1, 2023
@jeromedockes jeromedockes marked this pull request as ready for review December 1, 2023 15:56
@glemaitre glemaitre self-requested a review December 4, 2023 14:50
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up. Just a final suggestion for the test utility.

sklearn/utils/_testing.py Outdated Show resolved Hide resolved
@glemaitre glemaitre changed the title detect categorical polars columns in HistGradientBoosting ENH detect categorical polars columns in HistGradientBoosting Dec 4, 2023
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Just a couple of small comments.

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@glemaitre glemaitre merged commit e1ce967 into scikit-learn:main Dec 5, 2023
27 checks passed
@glemaitre
Copy link
Member

Thanks @jeromedockes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants