8000 TST Fix openml parser implementation for pandas-dev (#26386) · REDVM/scikit-learn@6d726f3 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6d726f3

Browse files
thomasjpfanREDVM
authored andcommitted
TST Fix openml parser implementation for pandas-dev (scikit-learn#26386)
1 parent 4955b71 commit 6d726f3

File tree

3 files changed

+9
-1
lines changed

3 files changed

+9
-1
lines changed

doc/whats_new/v1.3.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,9 @@ Changelog
221221
is deprecated and will be removed in v1.5.
222222
:pr:`25784` by :user:`Jérémie du Boisberranger`.
223223

224+
- |Fix| :func:`datasets.fetch_openml` returns improved data types when
225+
`as_frame=True` and `parser="liac-arff"`. :pr:`26386` by `Thomas Fan`_.
226+
224227
:mod:`sklearn.decomposition`
225228
............................
226229

sklearn/datasets/_arff_parser.py

Lines changed: 5 additions & 0 deletions
< 8000 /tr>
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,11 @@ def _io_to_generator(gzip_file):
199199
dfs.append(
200200
pd.DataFrame(data, columns=columns_names, copy=False)[columns_to_keep]
201201
)
202+
# dfs[0] contains only one row, which may not have enough data to infer to
203+
# column's dtype. Here we use `dfs[1]` to configure the dtype in dfs[0]
204+
if len(dfs) >= 2:
205+
dfs[0] = dfs[0].astype(dfs[1].dtypes)
206+
202207
frame = pd.concat(dfs, ignore_index=True)
203208
del dfs, first_df
204209

sklearn/datasets/tests/test_openml.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -925,7 +925,7 @@ def datasets_missing_values():
925925
# with casting it will be transformed to either float or Int64
926926
(40966, "pandas", 1, 77, 0),
927927
# titanic
928-
(40945, "liac-arff", 3, 5, 0),
928+
(40945, "liac-arff", 3, 6, 0),
929929
(40945, "pandas", 3, 3, 3),
930930
],
931931
)

0 commit comments

Comments
 (0)
0