-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MAINT Remove Python<3.9 code from sklearn.utils.fixes #27945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Remove Python<3.9 code from sklearn.utils.fixes #27945
Conversation
Thanks for thinking about checking those. It was in my todo list so happy that this is already done :) |
with _path(TEST_DATA_M 8000 ODULE, datafile) as data_path: | ||
X1, y1 = load_svmlight_file(str(data_path)) | ||
X2, y2 = load_svmlight_file(data_path) | ||
data_path = _svmlight_local_test_file_path(datafile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems like the resources.as_file
which was previously used in _path
makes sense when the resource is inside a .zip
, see https://docs.python.org/3.12/library/importlib.resources.html#importlib.resources.as_file
It seems like sklearn/datasets/tests/test_svmlight_format.py
is the only place we care about this so I removed it.
I am not too sure we care about this too much in general since loading .so from zips is not allowed according to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I don't think we care because it would have failed for all the other loaders. Nobody uses the .egg
package format anymore and both .whl
and conda packages get unzipped in the site-packages
folder prior to importing.
If we ever have someone complaining, we can add that feature consistently for all dataset loaders in a future PR. That might be useful when using scikit-learn in a frozen executable app for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A series of comments about specifying the encoding to avoid undefined behaviors on systems with varying locale configurations.
I did not suggest on all the lines but the following suggestions should be generalizable to all other calls to open
and read_text
in this PR.
sklearn/datasets/_base.py
Outdated
@@ -340,7 +340,8 @@ def load_csv_data( | |||
Description of the dataset (the content of `descr_file_name`). | |||
Only returned if `descr_file_name` is not None. | |||
""" | |||
with _open_text(data_module, data_file_name) as csv_file: | |||
data_path = resources.files(data_module) / data_file_name | |||
with data_path.open("r") as csv_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather never use open in text mode without specifying the expected encoding:
The pathlib.Path.open
doc refers to the open
built-in whose docstring states:
In text mode, if encoding is not specified the encoding used is platform
dependent: locale.getencoding() is called to get the current locale encoding.
Having a dataset loading behavior that depends on the OS configuration is the user is problematic because our datasets are all encoding with a fixed encoding.
I would be in favor of adding encoding="utf-8"
as additional kwarg to load_csv_data
, pass that to the underlying call and make it possible to pass an alternative encoding for specific datasets that use load_csv_data
on a case by case basis.
I merged #27910. Feel free to merge |
…nto remove-python3.9-backport
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…7945) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Clean-up some Python 3.9 backports once #27910 is merged.