8000 ENH Adds categories with missing values support to fetch_openml with as_frame=True by amy12xx · Pull Request #19365 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Adds categories with missing values support to fetch_openml with as_frame=True #19365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 6, 2021
4 changes: 0 additions & 4 deletions doc/whats_new/v0.24.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,6 @@ Changelog
files downloaded or cached to ensure data integrity.
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.

- |Feature| :func:`datasets.fetch_openml` now validates md5checksum of arff
files downloaded or cached to ensure data integrity.
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.

- |Enhancement| :func:`datasets.fetch_openml` now allows argument `as_frame`
to be 'auto', which tries to convert returned data to pandas DataFrame
unless data is sparse.
Expand Down
8 changes: 8 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,14 @@ Changelog
:class:`~sklearn.semi_supervised.LabelPropagation`.
:pr:`19271` by :user:`Zhaowei Wang <ThuWangzw>`.

:mod:`sklearn.datasets`
.......................

- |Enhancement| :func:`datasets.fetch_openml` now supports categories with
missing values when returning a pandas dataframe. :pr:`19365` by
`Thomas Fan`_ and :user:`Amanda Dsouza <amy12xx>` and
:user:`EL-ATEIF Sara <elateifsara>`.

Code and Documentation Contributors
-----------------------------------

Expand Down
6 changes: 5 additions & 1 deletion sklearn/datasets/_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from . import get_data_home
from urllib.error import HTTPError
from ..utils import Bunch
from ..utils import is_scalar_nan
from ..utils import get_chunk_n_rows
from ..utils import _chunk_generator
from ..utils import check_pandas_support # noqa
Expand Down Expand Up @@ -357,7 +358,10 @@ def _convert_arff_data_dataframe(
for column in columns_to_keep:
dtype = _feature_to_dtype(features_dict[column])
if dtype == 'category':
dtype = pd.api.types.CategoricalDtype(attributes[column])
cats_without_missing = [cat for cat in attributes[column]
if cat is not None and
not is_scalar_nan(cat)]
dtype = pd.api.types.CategoricalDtype(cats_without_missing)
df[column] = df[column].astype(dtype, copy=False)
return (df, )

Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
15 changes: 15 additions & 0 deletions sklearn/datasets/tests/test_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -1311,3 +1311,18 @@ def test_convert_arff_data_type():
msg = r"arff\['data'\] must be a generator when converting to pd.DataFrame"
with pytest.raises(ValueError, match=msg):
_convert_arff_data_dataframe(arff, ['a'], {})


def test_missing_values_pandas(monkeypatch):
"""check that missing values in categories are compatible with pandas
categorical"""
pytest.importorskip('pandas')

data_id = 42585
_monkey_patch_webbased_functions(monkeypatch, data_id, True)
penguins = fetch_openml(data_id=data_id, cache=False, as_frame=True)

cat_dtype = penguins.data.dtypes['sex']
# there are nans in the categorical
assert penguins.data['sex'].isna().any()
assert_array_equal(cat_dtype.categories, ['FEMALE', 'MALE', '_'])
0