8000 FIX Treat gradient boosting categoricals outside the bounds as unknow… · scikit-learn/scikit-learn@072b481 · GitHub
[go: up one dir, main page]

Skip to content

Commit 072b481

Browse files
authored
FIX Treat gradient boosting categoricals outside the bounds as unknown during predict (#24283)
1 parent 203a1b0 commit 072b481

File tree

4 files changed

+38
-1
lines changed

4 files changed

+38
-1
lines changed

doc/whats_new/v1.2.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,11 @@ Changelog
262262
NaN in feature importance when fitted with very small sample weight.
263263
:pr:`20415` by :user:`Zhehao Liu <MaxwellLZH>`.
264264

265+
- |Fix| :class:`ensemble.HistGradientBoostingClassifier` and
266+
:class:`ensemble.HistGradientBoostingRegressor` no longer error when predicting
267+
on categories encoded as negative values and instead consider them a member
268+
of the "missing category". :pr:`24283` by `Thomas Fan`_.
269+
265270
- |API| Rename the constructor parameter `base_estimator` to `estimator` in
266271
the following classes:
267272
:class:`ensemble.BaggingClassifier`,

sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,10 @@ cdef inline Y_DTYPE_C _predict_one_from_raw_data(
6060
else:
6161
node_idx = node.right
6262
elif node.is_categorical:
63-
if in_bitset_2d_memoryview(
63+
if data_val < 0:
64+
# data_val is not in the accepted range, so it is treated as missing value
65+
node_idx = node.left if node.missing_go_to_left else node.right
66+
elif in_bitset_2d_memoryview(
6467
raw_left_cat_bitsets,
6568
<X_BINNED_DTYPE_C>data_val,
6669
node.bitset_idx):

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1176,6 +1176,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
11761176
11771177
For each categorical feature, there must be at most `max_bins` unique
11781178
categories, and each categorical value must be in [0, max_bins -1].
1179+
During prediction, categories encoded as a negative value are treated as
1180+
missing values.
11791181
11801182
Read more in the :ref:`User Guide <categorical_support_gbdt>`.
11811183
@@ -1488,6 +1490,8 @@ class HistGradientBoostingClassifier(ClassifierMixin, BaseHistGradientBoosting):
14881490
14891491
For each categorical feature, there must be at most `max_bins` unique
14901492
categories, and each categorical value must be in [0, max_bins -1].
1493+
During prediction, categories encoded as a negative value are treated as
1494+
missing values.
14911495
14921496
Read more in the :ref:`User Guide <categorical_support_gbdt>`.
14931497

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1178,3 +1178,28 @@ def test_class_weights():
11781178
clf_balanced.decision_function(X_imb),
11791179
clf_sample_weight.decision_function(X_imb),
11801180
)
1181+
1182+
1183+
def test_unknown_category_that_are_negative():
1184+
"""Check that unknown categories that are negative does not error.
1185+
1186+
Non-regression test for #24274.
1187+
"""
1188+
rng = np.random.RandomState(42)
1189+
n_samples = 1000
1190+
X = np.c_[rng.rand(n_samples), rng.randint(4, size=n_samples)]
1191+
y = np.zeros(shape=n_samples)
1192+
y[X[:, 1] % 2 == 0] = 1
1193+
1194+
hist = HistGradientBoostingRegressor(
1195+
random_state=0,
1196+
categorical_features=[False, True],
1197+
max_iter=10,
1198+
).fit(X, y)
1199+
1200+
# Check that negative values from the second column are treated like a
1201+
# missing category
1202+
X_test_neg = np.asarray([[1, -2], [3, -4]])
1203+
X_test_nan = np.asarray([[1, np.nan], [3, np.nan]])
1204+
1205+
assert_allclose(hist.predict(X_test_neg), hist.predict(X_test_nan))

0 commit comments

Comments
 (0)
2A20
0