8000 ENH Adds categories with missing values support to fetch_openml with … · scikit-learn/scikit-learn@80c47b0 · GitHub
[go: up one dir, main page]

Skip to content

Commit 80c47b0

Browse files
amy12xxthomasjpfan
andauthored
ENH Adds categories with missing values support to fetch_openml with as_frame=True (#19365)
* MNT Fixes missing value loading into a dataframe from openml * FIX Include data files * DOC Adds whats new * REV Less diffs * DOC Uses docstring style comment in test * resolved conflict * resolve conflict in doc Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
1 parent a952fbb commit 80c47b0

8 files changed

+28
-5
lines changed

doc/whats_new/v0.24.rst

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -234,10 +234,6 @@ Changelog
234234
files downloaded or cached to ensure data integrity.
235235
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.
236236

237-
- |Feature| :func:`datasets.fetch_openml` now validates md5checksum of arff
238-
files downloaded or cached to ensure data integrity.
239-
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.
240-
241237
- |Enhancement| :func:`datasets.fetch_openml` now allows argument `as_frame`
242238
to be 'auto', which tries to convert returned data to pandas DataFrame
243239
unless data is sparse.

doc/whats_new/v1.0.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,14 @@ Changelog
172172
:class:`~sklearn.semi_supervised.LabelPropagation`.
173173
:pr:`19271` by :user:`Zhaowei Wang <ThuWangzw>`.
174174

175+
:mod:`sklearn.datasets`
176+
.......................
177+
178+
- |Enhancement| :func:`datasets.fetch_openml` now supports categories with
179+
missing values when returning a pandas dataframe. :pr:`19365` by
180+
`Thomas Fan`_ and :user:`Amanda Dsouza <amy12xx>` and
181+
:user:`EL-ATEIF Sara <elateifsara>`.
182+
175183
Code and Documentation Contributors
176184
-----------------------------------
177185

sklearn/datasets/_openml.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from . import get_data_home
2424
from urllib.error import HTTPError
2525
from ..utils import Bunch
26+
from ..utils import is_scalar_nan
2627
from ..utils import get_chunk_n_rows
2728
from ..utils import _chunk_generator
2829
from ..utils import check_pandas_support # noqa
@@ -357,7 +358,10 @@ def _convert_arff_data_dataframe(
357358
for column in columns_to_keep:
358359
dtype = _feature_to_dtype(features_dict[column])
359360
if dtype == 'category':
360-
dtype = pd.api.types.CategoricalDtype(attributes[column])
361+
cats_without_missing = [cat for cat in attributes[column]
362+
if cat is not None and
363+
not is_scalar_nan(cat)]
364+
dtype = pd.api.types.CategoricalDtype(cats_without_missing)
361365
df[column] = df[column].astype(dtype, copy=False)
362366
return (df, )
363367

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

sklearn/datasets/tests/test_openml.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1321,3 +1321,18 @@ def test_convert_arff_data_type():
13211321
msg = r"arff\['data'\] must be a generator when converting to pd.DataFrame"
13221322
with pytest.raises(ValueError, match=msg):
13231323
_convert_arff_data_dataframe(arff, ['a'], {})
1324+
1325+
1326+
def test_missing_values_pandas(monkeypatch):
1327+
"""check that missing values in categories are compatible with pandas
1328+
categorical"""
1329+
pytest.importorskip('pandas')
1330+
1331+
data_id = 42585
1332+
_monkey_patch_webbased_functions(monkeypatch, data_id, True)
1333+
penguins = fetch_openml(data_id=data_id, cache=False, as_frame=True)
1334+
1335+
cat_dtype = penguins.data.dtypes['sex']
1336+
# there are nans in the categorical
1337+
assert penguins.data['sex'].isna().any()
1338+
assert_array_equal(cat_dtype.categories, ['FEMALE', 'MALE', '_'])

0 commit comments

Comments
 (0)
0