10000 ENH Adds pandas IntegerArray support to check_array (#16508) · scikit-learn/scikit-learn@7c24d0a · GitHub
[go: up one dir, main page]

Skip to content

Commit 7c24d0a

Browse files
thomasjpfanadrinjalali
authored andcommitted
ENH Adds pandas IntegerArray support to check_array (#16508)
1 parent 4ad18e2 commit 7c24d0a

File tree

9 files changed

+186
-34
lines changed

9 files changed

+186
-34
lines changed

doc/whats_new/v0.23.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,10 @@ Changelog
316316
``max_value`` and ``min_value``. Array-like inputs allow a different max and min to be specified
317317
for each feature. :pr:`16403` by :user:`Narendra Mukherjee <narendramukherjee>`.
318318

319+
- |Enhancement| :class:`impute.SimpleImputer`, :class:`impute.KNNImputer`, and
320+
:class:`impute.SimpleImputer` accepts pandas' nullable integer dtype with
321+
missing values. :pr:`16508` by `Thomas Fan`_.
322+
319323
:mod:`sklearn.inspection`
320324
.........................
321325

@@ -485,6 +489,13 @@ Changelog
485489
can now contain `None`, where `drop_idx_[i] = None` means that no category
486490
is dropped for index `i`. :pr:`16585` by :user:`Chiara Marmo <cmarmo>`.
487491

492+
- |Enhancement| :class:`preprocessing.MaxAbsScaler`,
493+
:class:`preprocessing.MinMaxScaler`, :class:`preprocessing.StandardScaler`,
494+
:class:`preprocessing.PowerTransformer`,
495+
:class:`preprocessing.QuantileTransformer`,
496+
:class:`preprocessing.RobustScaler` now supports pandas' nullable integer
497+
dtype with missing values. :pr:`16508` by `Thomas Fan`_.
498+
488499
- |Efficiency| :class:`preprocessing.OneHotEncoder` is now faster at
489500
transforming. :pr:`15762` by `Thomas Fan`_.
490501

@@ -566,6 +577,15 @@ Changelog
566577
matrix from a pandas DataFrame that contains only `SparseArray` columns.
567578
:pr:`16728` by `Thomas Fan`_.
568579

580+
- |Enhancement| :func:`utils.validation.check_array` supports pandas'
581+
nullable integer dtype with missing values when `force_all_finite` is set to
582+
`False` or `'allow-nan'` in which case the data is converted to floating
583+
point values where `pd.NA` values are replaced by `np.nan`. As a consequence,
584+
all :mod:`sklearn.preprocessing` transformers that accept numeric inputs with
585+
missing values represented as `np.nan` now also accepts being directly fed
586+
pandas dataframes with `pd.Int* or `pd.Uint*` typed columns that use `pd.NA`
587+
as a missing value marker. :pr:`16508` by `Thomas Fan`_.
588+
569589
- |API| Passing classes to :func:`utils.estimator_checks.check_estimator` and
570590
:func:`utils.estimator_checks.parametrize_with_checks` is now deprecated,
571591
and support for classes will be removed in 0.24. Pass instances instead.

sklearn/impute/_base.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,9 @@ class SimpleImputer(_BaseImputer):
128128
----------
129129
missing_values : number, string, np.nan (default) or None
130130
The placeholder for the missing values. All occurrences of
131-
`missing_values` will be imputed.
131+
`missing_values` will be imputed. For pandas' dataframes with
132+
nullable integer dtypes with missing values, `missing_values`
133+
should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
132134
133135
strategy : string, default='mean'
134136
The imputation strategy.
@@ -476,8 +478,9 @@ class MissingIndicator(TransformerMixin, BaseEstimator):
476478
----------
477479
missing_values : number, string, np.nan (default) or None
478480
The placeholder for the missing values. All occurrences of
479-
`missing_values` will be indicated (True in the output array), the
480-
other values will be marked as False.
481+
`missing_values` will be imputed. For pandas' dataframes with
482+
nullable integer dtypes with missing values, `missing_values`
483+
should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
481484
482485
features : str, default=None
483486
Whether the imputer mask should represent all or a subset of

sklearn/impute/_iterative.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,9 @@ class IterativeImputer(_BaseImputer):
5454
5555
missing_values : int, np.nan, default=np.nan
5656
The placeholder for the missing values. All occurrences of
57-
``missing_values`` will be imputed.
57+
`missing_values` will be imputed. For pandas' dataframes with
58+
nullable integer dtypes with missing values, `missing_values`
59+
should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
5860
5961
sample_posterior : boolean, default=False
6062
Whether to sample from the (Gaussian) predictive posterior of the

sklearn/impute/_knn.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ class KNNImputer(_BaseImputer):
3232
----------
3333
missing_values : number, string, np.nan or None, default=`np.nan`
3434
The placeholder for the missing values. All occurrences of
35-
`missing_values` will be imputed.
35+
`missing_values` will be imputed. For pandas' dataframes with
36+
nullable integer dtypes with missing values, `missing_values`
37+
should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
3638
3739
n_neighbors : int, default=5
3840
Number of neighboring samples to use for imputation.

sklearn/impute/tests/test_common.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,32 @@ def test_imputers_add_indicator_sparse(imputer, marker):
8484
imputer.set_params(add_indicator=False)
8585
X_trans_no_indicator = imputer.fit_transform(X)
8686
assert_allclose_dense_sparse(X_trans[:, :-4], X_trans_no_indicator)
87+
88+
89+
# ConvergenceWarning will be raised by the IterativeImputer
90+
@pytest.mark.filterwarnings("ignore::sklearn.exceptions.ConvergenceWarning")
91+
@pytest.mark.parametrize("imputer", IMPUTERS)
92+
@pytest.mark.parametrize("add_indicator", [True, False])
93+
def test_imputers_pandas_na_integer_array_support(imputer, add_indicator):
94+
# Test pandas IntegerArray with pd.NA
95+
pd = pytest.importorskip('pandas', minversion="1.0")
96+
marker = np.nan
97+
imputer = imputer.set_params(add_indicator=add_indicator,
98+
missing_values=marker)
99+
100+
X = np.array([
101+
[marker, 1, 5, marker, 1],
102+
[2, marker, 1, marker, 2],
103+
[6, 3, marker, marker, 3],
104+
[1, 2, 9, marker, 4]
105+
])
106+
# fit on numpy array
107+
X_trans_expected = imputer.fit_transform(X)
108+
109+
# Creates dataframe with IntegerArrays with pd.NA
110+
X_df = pd.DataFrame(X, dtype="Int16", columns=["a", "b", "c", "d", "e"])
111+
112+
# fit on pandas dataframe with IntegerArrays
113+
X_trans = imputer.fit_transform(X_df)
114+
115+
assert_allclose(X_trans_expected, X_trans)

sklearn/metrics/pairwise.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -100,17 +100,20 @@ def check_pairwise_arrays(X, Y, *, precomputed=False, dtype=None,
100100
raise an error.
101101
102102
force_all_finite : boolean or 'allow-nan', (default=True)
103-
Whether to raise an error on np.inf and np.nan in array. The
103+
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
104104
possibilities are:
105105
106106
- True: Force all values of array to be finite.
107-
- False: accept both np.inf and np.nan in array.
108-
- 'allow-nan': accept only np.nan values in array. Values cannot
109-
be infinite.
107+
- False: accepts np.inf, np.nan, pd.NA in array.
108+
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
109+
cannot be infinite.
110110
111111
.. versionadded:: 0.22
112112
``force_all_finite`` accepts the string ``'allow-nan'``.
113113
114+
.. versionchanged:: 0.23
115+
Accepts `pd.NA` and converts it into `np.nan`
116+
114117
copy : bool
115118
Whether a forced copy will be triggered. If copy=False, a copy might
116119
be triggered by a conversion.
@@ -1691,15 +1694,19 @@ def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None,
16911694
for more details.
16921695
16931696
force_all_finite : boolean or 'allow-nan', (default=True)
1694-
Whether to raise an error on np.inf and np.nan in array. The
1697+
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
16951698
possibilities are:
16961699
16971700
- True: Force all values of array to be finite.
1698-
- False: accept both np.inf and np.nan in array.
1699-
- 'allow-nan': accept only np.nan values in array. Values cannot
1700-
be infinite.
1701+
- False: accepts np.inf, np.nan, pd.NA in array.
1702+
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
1703+
cannot be infinite.
17011704
17021705
.. versionadded:: 0.22
1706+
``force_all_finite`` accepts the string ``'allow-nan'``.
1707+
1708+
.. versionchanged:: 0.23
1709+
Accepts `pd.NA` and converts it into `np.nan`
17031710
17041711
**kwds : optional keyword parameters
17051712
Any further parameters are passed directly to the distance function.

sklearn/preprocessing/tests/test_common.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,3 +126,33 @@ def test_missing_value_handling(est, func, support_sparse, strictly_positive):
126126
Xt_inv_sp = est_sparse.inverse_transform(Xt_sp)
127127
assert len(records) == 0
128128
assert_allclose(Xt_inv_sp.A, Xt_inv_dense)
129+
130+
131+
@pytest.mark.parametrize(
132+
"est, func",
133+
[(MaxAbsScaler(), maxabs_scale),
134+
(MinMaxScaler(), minmax_scale),
135+
(StandardScaler(), scale),
136+
(StandardScaler(with_mean=False), scale),
137+
(PowerTransformer('yeo-johnson'), power_transform),
138+
(PowerTransformer('box-cox'), power_transform,),
139+
(QuantileTransformer(n_quantiles=3), quantile_transform),
140+
(RobustScaler(), robust_scale),
141+
(RobustScaler(with_centering=False), robust_scale)]
142+
)
143+
def test_missing_value_pandas_na_support(est, func):
144+
# Test pandas IntegerArray with pd.NA
145+
pd = pytest.importorskip('pandas', minversion="1.0")
146+
147+
X = np.array([[1, 2, 3, np.nan, np.nan, 4, 5, 1],
148+
[np.nan, np.nan, 8, 4, 6, np.nan, np.nan, 8],
149+
[1, 2, 3, 4, 5, 6, 7, 8]]).T
150+
151+
# Creates dataframe with IntegerArrays with pd.NA
152+
X_df = pd.DataFrame(X, dtype="Int16", columns=['a', 'b', 'c'])
153+
X_df['c'] = X_df['c'].astype('int')
154+
155+
X_trans = est.fit_transform(X)
156+
X_df_trans = est.fit_transform(X_df)
157+
158+
assert_allclose(X_trans, X_df_trans)

sklearn/utils/tests/test_validation.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,37 @@ def test_check_array():
349349
check_array(X, dtype="numeric")
350350

351351

352+
@pytest.mark.parametrize("pd_dtype", ["Int8", "Int16", "UInt8", "UInt16"])
353+
@pytest.mark.parametrize("dtype, expected_dtype", [
354+
([np.float32, np.float64], np.float32),
355+
(np.float64, np.float64),
356+
("numeric", np.float64),
357+
])
358+
def test_check_array_pandas_na_support(pd_dtype, dtype, expected_dtype):
359+
# Test pandas IntegerArray with pd.NA
360+
pd = pytest.importorskip('pandas', minversion="1.0")
361+
362+
X_np = np.array([[1, 2, 3, np.nan, np.nan],
363+
[np.nan, np.nan, 8, 4, 6],
364+
[1, 2, 3, 4, 5]]).T
365+
366+
# Creates dataframe with IntegerArrays with pd.NA
367+
X = pd.DataFrame(X_np, dtype=pd_dtype, columns=['a', 'b', 'c'])
368+
# column c has no nans
369+
X['c'] = X['c'].astype('float')
370+
X_checked = check_array(X, force_all_finite='allow-nan', dtype=dtype)
371+
assert_allclose(X_checked, X_np)
372+
assert X_checked.dtype == expected_dtype
373+
374+
X_checked = check_array(X, force_all_finite=False, dtype=dtype)
375+
assert_allclose(X_checked, X_np)
376+
assert X_checked.dtype == expected_dtype
377+
378+
msg = "Input contains NaN, infinity"
379+
with pytest.raises(ValueError, match=msg):
380+
check_array(X, force_all_finite=True)
381+
382+
352383
def test_check_array_pandas_dtype_object_conversion():
353384
# test that data-frame like objects with dtype object
354385
# get converted

sklearn/utils/validation.py

Lines changed: 49 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -135,17 +135,20 @@ def as_float_array(X, *, copy=True, force_all_finite=True):
135135
returned if X's dtype is not a floating point type.
136136
137137
force_all_finite : boolean or 'allow-nan', (default=True)
138-
Whether to raise an error on np.inf and np.nan in X. The possibilities
139-
are:
138+
Whether to raise an error on np.inf, np.nan, pd.NA in X. The
139+
possibilities are:
140140
141141
- True: Force all values of X to be finite.
142-
- False: accept both np.inf and np.nan in X.
143-
- 'allow-nan': accept only np.nan values in X. Values cannot be
144-
infinite.
142+
- False: accepts np.inf, np.nan, pd.NA in X.
143+
- 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
144+
be infinite.
145145
146146
.. versionadded:: 0.20
147147
``force_all_finite`` accepts the string ``'allow-nan'``.
148148
149+
.. versionchanged:: 0.23
150+
Accepts `pd.NA` and converts it into `np.nan`
151+
149152
Returns
150153
-------
151154
XT : {array, sparse matrix}
@@ -317,17 +320,20 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
317320
be triggered by a conversion.
318321
319322
force_all_finite : boolean or 'allow-nan', (default=True)
320-
Whether to raise an error on np.inf and np.nan in X. The possibilities
321-
are:
323+
Whether to raise an error on np.inf, np.nan, pd.NA in X. The
324+
possibilities are:
322325
323326
- True: Force all values of X to be finite.
324-
- False: accept both np.inf and np.nan in X.
325-
- 'allow-nan': accept only np.nan values in X. Values cannot be
326-
infinite.
327+
- False: accepts np.inf, np.nan, pd.NA in X.
328+
- 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
329+
be infinite.
327330
328331
.. versionadded:: 0.20
329332
``force_all_finite`` accepts the string ``'allow-nan'``.
330333
334+
.. versionchanged:: 0.23
335+
Accepts `pd.NA` and converts it into `np.nan`
336+
331337
Returns
332338
-------
333339
spmatrix_converted : scipy sparse matrix.
@@ -438,19 +444,20 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
438444
be triggered by a conversion.
439445
440446
force_all_finite : boolean or 'allow-nan', (default=True)
441-
Whether to raise an error on np.inf and np.nan in array. The
447+
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
442448
possibilities are:
443449
444450
- True: Force all values of array to be finite.
445-
- False: accept both np.inf and np.nan in array.
446-
- 'allow-nan': accept only np.nan values in array. Values cannot
447-
be infinite.
448-
449-
For object dtyped data, only np.nan is checked and not np.inf.
451+
- False: accepts np.inf, np.nan, pd.NA in array.
452+
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
453+
cannot be infinite.
450454
451455
.. versionadded:: 0.20
452456
``force_all_finite`` accepts the string ``'allow-nan'``.
453457
458+
.. versionchanged:: 0.23
459+
Accepts `pd.NA` and converts it into `np.nan`
460+
454461
ensure_2d : boolean (default=True)
455462
Whether to raise a value error if array is not 2D.
456463
@@ -491,6 +498,7 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
491498
# check if the object contains several dtypes (typically a pandas
492499
# DataFrame), and store them. If not, store None.
493500
dtypes_orig = None
501+
has_pd_integer_array = False
494502
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
495503
# throw warning if columns are sparse. If all columns are sparse, then
496504
# array.sparse exists and sparsity will be perserved (later).
@@ -508,6 +516,19 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
508516
for i, dtype_iter in enumerate(dtypes_orig):
509517
if dtype_iter.kind == 'b':
510518
dtypes_orig[i] = np.dtype(np.object)
519+
elif dtype_iter.name.startswith(("Int", "UInt")):
520+
# name looks like an Integer Extension Array, now check for
521+
# the dtype
522+
with suppress(ImportError):
523+
from pandas import (Int8Dtype, Int16Dtype,
524+
Int32Dtype, Int64Dtype,
525+
UInt8Dtype, UInt16Dtype,
526+
UInt32Dtype, UInt64Dtype)
527+
if isinstance(dtype_iter, (Int8Dtype, Int16Dtype,
528+
Int32Dtype, Int64Dtype,
529+
UInt8Dtype, UInt16Dtype,
530+
UInt32Dtype, UInt64Dtype)):
531+
has_pd_integer_array = True
511532

512533
if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
513534
dtype_orig = np.result_type(*dtypes_orig)
@@ -528,6 +549,10 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
528549
# list of accepted types.
529550
dtype = dtype[0]
530551

552+
if has_pd_integer_array:
553+
# If there are any pandas integer extension arrays,
554+
array = array.astype(dtype)
555+
531556
if force_all_finite not in (True, False, 'allow-nan'):
532557
raise ValueError('force_all_finite should be a bool or "allow-nan"'
533558
'. Got {!r} instead'.format(force_all_finite))
@@ -712,18 +737,21 @@ def check_X_y(X, y, accept_sparse=False, *, accept_large_sparse=True,
712737
be triggered by a conversion.
713738
714739
force_all_finite : boolean or 'allow-nan', (default=True)
715-
Whether to raise an error on np.inf and np.nan in X. This parameter
716-
does not influence whether y can have np.inf or np.nan values.
740+
Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter
741+
does not influence whether y can have np.inf, np.nan, pd.NA values.
717742
The possibilities are:
718743
719744
- True: Force all values of X to be finite.
720-
- False: accept both np.inf and np.nan in X.
721-
- 'allow-nan': accept only np.nan values in X. Values cannot be
722-
infinite.
745+
- False: accepts np.inf, np.nan, pd.NA in X.
746+
- 'allow-nan': accepts only np.nan or pd.NA values in X. Values cannot
747+
be infinite.
723748
724749
.. versionadded:: 0.20
725750
``force_all_finite`` accepts the string ``'allow-nan'``.
726751
752+
.. versionchanged:: 0.23
753+
Accepts `pd.NA` and converts it into `np.nan`
754+
727755
ensure_2d : boolean (default=True)
728756
Whether to raise a value error if X is not 2D.
729757

0 commit comments

Comments
 (0)
0