ENH detect categorical polars columns in HistGradientBoosting #27835

jeromedockes · 2023-11-23T14:28:09Z

Reference Issues/PRs

This extends #26411 so that the 8000 option categorical_features="from_dtype" in HistGradientBoosting also works when the input is a polars dataframe

What does this implement/fix? Explain your changes.

this is a POC after (oral) discussion of the feasibility of generalizing #26411 with the dataframe interchange protocol

Any other comments?

Meta-issue for polars / pyarrow support: #25896.

github-actions · 2023-11-23T14:29:26Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f244d29. Link to the linter CI: here}

jeromedockes · 2023-11-23T14:47:10Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+            categorical_columns_mask = np.asarray(
+                [
+                    c.dtype[0].name == "CATEGORICAL"
+                    and c.describe_categorical["is_dictionary"]


ATM in polars, pandas and pyarrow from_dataframe raises a NotImplementedError for categorical columns for which is_dictionary is false; columns produced by pandas and polars always have is_dictionary True

What is the meaning of is_dictionary? Maybe you could add an educational inline comment to explain why expect this to be True for our use case.

columns produced by pandas and polars always have is_dictionary True

What about pyarrow?

What about pyarrow?

AFAICT for pyarrow as well "is_dictionary" is always True, both in the columns it produces and accepts

the protocol's description for is_dictionary is:

whether a mapping of categorical values to other objects exists

when is_dictionary is True, the categorical column's attribute describe_categorical, which is a dict, has a value for "categories", which is a Column providing an implicit mapping between mapping from int (the indices) to categories

>>> import pandas as pd >>> df = pd.DataFrame({"A": pd.Series(["b", "a"], dtype="category")}) >>> df["A"].cat.categories Index(['a', 'b'], dtype='object') >>> int_col = df.__dataframe__().get_column(0) >>> int_col.describe_categorical['is_dictionary'] True >>> int_col.describe_categorical['categories']._col 0 a 1 b dtype: object

so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"].

I suppose when is_dictionary is False the category values are stored directly in the column (and describe_categorical["categories"] is None), but I don't have an example.
In that case we could get the categories by loading the column, filtering nulls and getting unique values.
That is what is done for non-dataframe inputs, but the code that does it assumes the values are numbers so we may need to adapt it.

that is my impression but I may be misunderstanding the interchange protocol 😅 as I have no experience with it

so in that case (which seems to be always for pandas, polars and pyarrow) we can read the categories from describe_categorical["categories"].

I agree with your analysis but unfortunately I did not find an easy way to re-construct the list of unique categories as a numpy array (e.g. using numpy.from_dlpack or even a Python list from the public dataframe interchange API only).

I couldn't either (without using pyarrow). Then I guess we should give up on using the interchage protocol to get the categories, and use it only to identify categorical columns. We can then get the categories by loading the column from the original dataframe into a numpy array, as is done in main. In that case we don't care about the value of "is_dictionary".
(As I mentioned I have the impression the easiest way to do it is to let the OrdinalEncoder find the categories.)

I agree. Then we can later improve the implementation of category introspection in OrdinalEncoder a separate, independent.

jeromedockes · 2023-11-23T14:55:26Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                        X.__dataframe__()
+                        .get_column(f_idx)
+                        .describe_categorical["categories"]
+                        ._col


relying on the fact that both pandas and polars happen to store the original Series in ._col is definitely not great. It is the strategy used in pandas.from_dataframe for categorical columns. pyarrow has a buffers_to_array function that can actually load the buffers from a dataframe api interchange Column

another option would be to not rely on the dataframe interchange protocol to find the categories, but only to discover the categorical columns. Those can then be loaded from X with np.asarray and then we can remove missing values and duplicates in a way similar to what is done below when X is not a dataframe.
the code would be different than what is done for non-dataframe X because that only works for numbers (eg it relies on np.isnan) but polars categories are always strings

Indeed, this is quite suboptimal. Decoding the categories array from the raw buffers seems non-trivial with numpy.

Maybe @jorisvandenbossche or @MarcoGorelli have a better suggestion?

or maybe we could use the OrdinalEncoder with categories="auto", and check that the number of categories is smaller than max_bins after doing the preprocessing, by checking its categories_ fitted attribute?

otherwise I have the impression we implement the logic for finding categories and handling missing values twice, once here and once in the OrdinalEncoder.

in this case here we would just need the interchange protocol to discover which columns are categorical. and rather than implementing it in the gradient boosting maybe the compose.make_column_selector could be generalized a bit to allow selecting by interchange DTypeKind, so that the preprocessing done in the gradient boosting would rely more on ColumnTransformer, OrdinalEncoder and make_column_selector

That sounds like a good plan to me. @thomasjpfan any opinion?

Improving make_column_selector to select the categories makes sense. I would open another PR to see what it looks like in code.

maybe we could use the OrdinalEncoder with categories="auto",

This would most likely work. Originally, I used pandas's unique function to find the unique categories, because it was faster than having OrdinalEncoder automatically do it. In the end, I wanted to updated OridinalEncoder to have a fast path for pandas series.

Note, the issue with grabbing the categories from the dtype, is that it is a superset of the actual categories. For example, the input data could consistent of 2 unique categories, but the categorical dtype can consistent of 300 categories. OridinalEncoder's current behavior is only to discover the categories that are actually in the data.

from pandas.api.types import CategoricalDtype import pandas as pd from sklearn.preprocessing import OrdinalEncoder X = pd.DataFrame( { "f1": pd.Series(["c1", "c2"], dtype=CategoricalDtype(categories=[f"c{i}" for i in range(300)])) } ) enc = OrdinalEncoder().fit(X) print(enc.categories_) # [array(['c1', 'c2'], dtype=object)]

Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins.

that's a good point and another reason for not relying on the dtype but on the actual column values.

Also, if there are only 2 actual categories but the dtype has 300 categories, then I expect histogram gradient boosting should still work because 2 actual categories < max_bins.

Indeed that makes more sense. A similar issue I noticed is that in the main branch if a categorical column contains values greater than max_bins (before re-encoding them with the OrdinalEncoder), this check raises a ValueError. However as they get re-encoded later, I think what matters is the number of actual categories (the number of unique values) rather than the max of the column.

import numpy as np from sklearn.ensemble import HistGradientBoostingRegressor X = np.asarray([0, 1, 1000])[:, None] y = [0., 0., 0.] gb = HistGradientBoostingRegressor(categorical_features=[0]) gb.fit(X, y) # ValueError: Categorical feature at index 0 is expected to be encoded with values < 255 but the largest value for the encoded categories is 1000.

If I remove the check fit and predict work as expected.

that's a good point and another reason for not relying on the dtype but on the actual column values.

At the same time, I find that having dimensions of model parameters, shape of the output of transform or predict_proba depend on the observed values in X or y is a bit annoying, especially when trying to consolidate models fit on resampled datasets with rare categories in features or classes in the target. I recently faced the y-variant of this issue when experimenting with a new kind of meta estimator for quantile regression:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/quantile_regression_as_classification.ipynb

In the predict_quantiles method, I have to post-process the output of predict_proba to deal with unobserved classes.

When the scikit-learn API was designed, we only had numpy arrays in mind. The expected number of classes (and later categories in features) was never present in numpy dtypes, hence we naturally had to infer it implicitly from the training set values.

But now that data science is moving more towards dataframe with richer dtype metadata, maybe this design decision is counter productive.

For classes / predict_proba, I think we don't want to change the current behavior, at least not in the short term.

But when adding dataframe support for categorical feature aware transformers / estimators, maybe we should have an option to collapse categories to the observed ones or preserve all categories specified in the dtype, even when unused.

Indeed it can be annoying to need to perform this kind of post-processing. Although for the specific case of categorical features (not outputs) for the gradient boosting, I don't see which public attributes or method outputs would be changed by taking into account categories that do not show up in the training data?

Indeed my reflection was either for the categorical transformers and maybe later for categorical-encoded y for classsifiers (which we do not handle in any special way as of now).

For categorical feature in HGBDT, the PR is good the way it is (delegating to OrdinalEncoder and only use the observed categories observed in the training set).

…dtype-polars

ogrisel

I think I like this PR the way it is now.

I think the optimization categories = X.iloc[:, f_idx].unique().dropna().to_numpy() for pandas dataframes should be readded to OrdinalEncoder in a follow-up PR if categorical dtype inspection is either to challenging to implement correctly with the DF interchange API or too disruptive in terms of change of behavior (as discussed in above threads).

Before considering a merge of this PR, I would just like to add pyarrow support in the tests as detailed in the comments below:

doc/whats_new/v1.4.rst

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

ogrisel · 2023-12-01T15:51:44Z

Also @jeromedockes please remove the WIP/Draft markers of this PR unless you still have TODOs in mind. In which case, please state them explicitly in a bullet list in the description of the PR.

…dtype-polars

ogrisel

Thanks for the follow-up. Just a final suggestion for the test utility.

sklearn/utils/_testing.py

glemaitre

It looks good. Just a couple of small comments.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

doc/whats_new/v1.4.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre

LGTM otherwise.

glemaitre · 2023-12-05T09:15:17Z

Thanks @jeromedockes

detect categorical polars columns in HistGradientBoosting

c9b62d1

jeromedockes marked this pull request as draft November 23, 2023 14:28

github-actions bot added the module:ensemble label Nov 23, 2023

avoid sorting categories twice

fce0b27

jeromedockes comm 10000 ented Nov 23, 2023

View reviewed changes

jeromedockes added 11 commits November 23, 2023 16:16

A935 Merge remote-tracking branch 'upstream/main' into gb-categories-from-…

9a34960

…dtype-polars

ensure string and int categories have same ordering in test

07f127a

rely on OrdinalEncoder to discover categories

8424e04

Merge remote-tracking branch 'upstream/main' into gb-categories-from-…

1ad1116

…dtype-polars

remove None from known_categories

dd50ce7

unused line

5d86059

fallback for old pandas that do not support the interchange protocol

0ac1d7d

fix dtype mask construction

7070519

update docstrings

6581244

update user guide

5f7f9de

whatsnew

2dc5103

ogrisel approved these changes Dec 1, 2023

View reviewed changes

jeromedockes changed the title ~~[WIP] detect categorical polars columns in HistGradientBoosting~~ detect categorical polars columns in HistGradientBoosting Dec 1, 2023

jeromedockes marked this pull request as ready for review December 1, 2023 15:56

jeromedockes added 7 commits December 4, 2023 14:34

Merge remote-tracking branch 'upstream/main' into gb-categories-from-…

d661d64

…dtype-polars

replace px.DataFrame with _convert_container

896cc33

improve test docstring

b04a9dc

add pyarrow categorical features

e67fdb1

add test for categorical_feature_names

df1502f

mention only polars in whatsnew entry

e228d82

Merge remote-tracking branch 'upstream/main' into gb-categories-from-…

ddbf9f5

…dtype-polars

glemaitre self-requested a review December 4, 2023 14:50

ogrisel approved these changes Dec 4, 2023

View reviewed changes

sklearn/utils/_testing.py Outdated Show resolved Hide resolved

glemaitre changed the title ~~detect categorical polars columns in HistGradientBoosting~~ ENH detect categorical polars columns in HistGradientBoosting Dec 4, 2023

glemaitre reviewed Dec 4, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Show resolved Hide resolved

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

Update sklearn/utils/_testing.py

692ffed

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre approved these changes Dec 4, 2023

View reviewed changes

jeromedockes added 2 commits December 4, 2023 16:56

use constructor_name="pandas" in _convert_container

5ccfd3a

address review comments

f244d29

glemaitre merged commit e1ce967 into scikit-learn:main Dec 5, 2023

jeromedockes mentioned this pull request Dec 7, 2023

Attempt to speed up unique value discovery in _BaseEncoder for polars and pandas series #27911

Open

Uh oh!

ENH detect categorical polars columns in HistGradientBoosting #27835

ENH detect categorical polars columns in HistGradientBoosting #27835

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!