-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Description
Describe the bug
At check_array, dtype_orig is determined for array objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig). This excludes the pandas nullable extension types such as boolean, Int64, and Float64, resulting in a dtype_orig of None.
If pandas_requires_conversion, then there ends up being a call to array = array.astype(None), which pandas will take to mean a conversion to float64 should be attempted. If non numeric/boolean data is present in array, this can result in a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.
I first found this in using the imblearn SMOTEN and SMOTENC oversamplers, but this could happen from other uses of check_array.
Steps/Code to Reproduce
Reproduction via oversamplers
import pandas as pd
from imblearn import over_sampling as im
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
y = pd.Series([0, 1, 1, 0], dtype="int64")
for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
with pytest.raises(ValueError):
oversampler.fit_resample(X, y)Reproduction via check_array directly
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
with pytest.raises(ValueError):
check_array(X, dtype=None)Expected Results
We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"], which is no error.
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["bool", "int64", "float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
check_array(X, dtype=None)Actual Results
The actual results is a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.
Versions
System:
python: 3.8.2 (default, May 21 2021, 12:12:59) [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.2
pip: 22.2.2
setuptools: 59.8.0
numpy: 1.22.4
scipy: 1.8.1
Cython: 0.29.32
pandas: 1.5.3
matplotlib: 3.5.3
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True