Open
Description
Describe the bug
The column dtype
changes from category
to object
when I transform it using SimpleImputer
.
Here is a list of related Issues and PRs that I found while trying to solve this problem:
#29381
#18860
#17625
#17526
#17525
If this is truly a bug, I would like to work on a fix.
Steps/Code to Reproduce
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.DataFrame(data=['A', 'B', 'C', 'A', pd.NA], columns=['column_1'], dtype='category')
df.info()
imputer = SimpleImputer(missing_values=pd.NA, strategy="most_frequent").set_output(transform='pandas')
output = imputer.fit_transform(df)
output.info()
Expected Results
This is the output I expected to see on the terminal
> > > df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column_1 4 non-null category
dtypes: category(1)
memory usage: 269.0 bytes
>>> output.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entrie
73C5
s, 0 to 4
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column_1 5 non-null category
dtypes: object(1)
memory usage: 172.0+ bytes
I expected output
to keep the same dtype
as the original pd.DataFrame
.
Actual Results
The actual results for when output.info()
is called is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column_1 5 non-null object
dtypes: object(1)
memory usage: 172.0+ bytes
Observe that the Dtype
for column_1
is now object instead of category.
Versions
System:
python: 3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]
executable: /home/user/miniconda3/envs/prod/bin/python
machine: Linux-6.8.0-59-lowlatency-x86_64-with-glibc2.39
Python dependencies:
sklearn: 1.5.2
pip: 25.0
setuptools: 75.8.0
numpy: 2.1.1
scipy: 1.14.1
Cython: None
pandas: 2.2.3
matplotlib: 3.9.2
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/user/miniconda3/envs/prod/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-ff651d7f.so
version: 0.3.27
threading_layer: pthreads
architecture: SkylakeX
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/user/miniconda3/envs/prod/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
version: 0.3.27.dev
threading_layer: pthreads
architecture: SkylakeX
user_api: openmp
internal_api: openmp
num_threads: 16
prefix: libgomp
filepath: /home/user/miniconda3/envs/prod/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None