8000 Update user guide and linked example to match · scikit-learn/scikit-learn@75b1afe · GitHub
[go: up one dir, main page]

Skip to content

Commit 75b1afe

Browse files
committed
Update user guide and linked example to match
1 parent 5fa30ec commit 75b1afe

File tree

2 files changed

+44
-16
lines changed

2 files changed

+44
-16
lines changed

doc/modules/ensemble.rst

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1077,10 +1077,13 @@ categorical features as continuous (ordinal), which happens for ordinal-encoded
10771077
categorical data, since categories are nominal quantities where order does not
10781078
matter.
10791079

1080-
To enable categorical support, a boolean mask can be passed to the
1081-
`categorical_features` parameter, indicating which feature is categorical. In
1082-
the following, the first feature will be treated as categorical and the
1083-
second feature as numerical::
1080+
There are several ways to use the native categorical feature support for those
1081+
estimators. The simplest way it to pass the training data as `pandas.DataFrame`
1082+
where the categorical features are of type `category`.
1083+
1084+
Alternatively it is possible to pass a boolean mask to the `categorical_features`
1085+
parameter, indicating which feature is categorical. In the following, the first
1086+
feature will be treated as categorical and the second feature as numerical::
10841087

10851088
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])
10861089

@@ -1089,10 +1092,18 @@ categorical features::
10891092

10901093
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])
10911094

1092-
The cardinality of each categorical feature should be less than the `max_bins`
1093-
parameter, and each categorical feature is expected to be encoded in
1094-
`[0, max_bins - 1]`. To that end, it might be useful to pre-process the data
1095-
with an :class:`~sklearn.preprocessing.OrdinalEncoder` as done in
1095+
Finally, one can pass a list of strings indicating the names of the categorical
1096+
if training data is passed as a dataframe with string column names::
1097+
1098+
>>> gbdt = HistGradientBoostingClassifier(categorical_features=['f0'])
1099+
1100+
In any case, the cardinality of each categorical feature should be less than
1101+
the `max_bins` parameter, and each categorical feature is expected to be
1102+
encoded in `[0, max_bins - 1]`.
1103+
1104+
If the original data is not already using numerical encoding for the
1105+
categorical features, it can to pre-processed with an
1106+
:class:`~sklearn.preprocessing.OrdinalEncoder` as done in
10961107
:ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`.
10971108

10981109
If there are missing values during training, the missing values will be

examples/ensemble/plot_gradient_boosting_categorical.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -126,10 +126,6 @@
126126
make_column_selector(dtype_include="category"),
127127
),
128128
remainder="passthrough",
129-
# Use short feature names to make it easier to specify the categorical
130-
# variables in the HistGradientBoostingRegressor in the next step
131-
# of the pipeline.
132-
verbose_feature_names_out=False,
133129
)
134130

135131
hist_ordinal = make_pipeline(
@@ -146,12 +142,33 @@
146142
# To benefit from this, one option is to encode the categorical features using the
147143
# pandas categorical dtype which we already did at the beginning of this
148144
# example with the call to `.astype("category")`.
149-
#
150-
# Note that this is equivalent to using the ordinal encoder and then passing
151-
# the name of the categorical features to the ``categorical_features``
152-
# constructor parameter of :class:`~ensemble.HistGradientBoostingRegressor`.
153145
hist_native = HistGradientBoostingRegressor(random_state=42)
154146

147+
# %%
148+
# Note that this is equivalent to using the ordinal encoder that output pandas
149+
# dataframe with unchanged column names and then passing the name of the
150+
# categorical features to the ``categorical_features`` constructor parameter of
151+
# :class:`~ensemble.HistGradientBoostingRegressor`:
152+
153+
ordinal_encoder = make_column_transformer(
154+
(
155+
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
156+
categorical_columns,
157+
),
158+
remainder="passthrough",
159+
# Use short feature names to make it easier to specify the categorical
160+
# variables in the HistGradientBoostingRegressor in the next step
161+
# of the pipeline.
162+
verbose_feature_names_out=False,
163+
).set_output(transform="pandas")
164+
165+
hist_native2 = make_pipeline(
166+
ordinal_encoder,
167+
HistGradientBoostingRegressor(
168+
categorical_features=categorical_columns, random_state=42
169+
),
170+
)
171+
155172
# %%
156173
# Model comparison
157174
# ----------------

0 commit comments

Comments
 (0)
0