8000 Implement some methods almost compatible with Scikit-learn private methods. by himkt · Pull Request #952 · optuna/optuna · GitHub
[go: up one dir, main page]

Skip to content

Implement some methods almost compatible with Scikit-learn private methods. #952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 10, 2020
Merged

Implement some methods almost compatible with Scikit-learn private methods. #952

merged 19 commits into from
Mar 10, 2020

Conversation

himkt
Copy link
Member
@himkt himkt commented Feb 22, 2020

This PR is follow-up for #881.

I implement some methods to reduce dependencies on scikit-learn private methods.
I basically define methods to be compatible with scikit-learn's, but some points are di 8000 fferent.

(The name of this branch should be sklearn-privates...:innocent:)

)


def _num_samples(x):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original implementation is here.
How much should I take care of the original implementation? (is this too simple?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically seems good to me to keep the code simple.
This implementation removes the handling of exceptional cases, so we may have bug reports about them.
So, please add a link to the original implementation (including the commit id) as a comment.

Also, I'm not familiar with Dask dataframe, but I think we need the following check for Dask users.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L155-L158

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. 🙏
I added the specific check for a dask dataframe in f8f21a5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. 🙏
I added the specific check for a dask dataframe in f8f21a5.

fit_params_validated: Dict = {}
for key, value in fit_params.items():
if (
not _is_arraylike(value) or
Copy link
Member Author
@himkt himkt Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original implementation is here.

Currently, scikit-learn does not accept non-iterable inputs and this line is for keeping backward compatibility.
scikit-learn/scikit-learn#15805

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the link to the original implementation with the commit id in a comment for the future development.

@himkt himkt changed the title Implement some methods almost compatible with NumPy private methods Implement some methods almost compatible with Scikit-learn private methods Feb 23, 2020
Copy link
Member
@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. I confirmed that examples/optuna_search_cv_simple.py successfully worked with this implementation using v0.20.4, 0.21.3 and 0.22.1 of scikit-learn.

@@ -18,7 +17,6 @@
from sklearn.utils import check_random_state
from sklearn.utils.metaestimators import _safe_split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_safe_split is also a private method of sklearn. I think we can work on it in the new PR because it is not related to #881.

)


def _num_samples(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically seems good to me to keep the code simple.
This implementation removes the handling of exceptional cases, so we may have bug reports about them.
So, please add a link to the original implementation (including the commit id) as a comment.

Also, I'm not familiar with Dask dataframe, but I think we need the following check for Dask users.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L155-L158

fit_params_validated: Dict = {}
for key, value in fit_params.items():
if (
not _is_arraylike(value) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the link to the original implementation with the commit id in a comment for the future development.

):
fit_params_validated[key] = value
else:
fit_params_validated[key] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code here applies the _make_indexiable to value. Do we have any drawbacks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I missed it.
I added _make_indexable in 2be88c7.

himkt and others added 6 commits February 27, 2020 22:55
Co-Authored-By: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-Authored-By: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-Authored-By: Toshihiko Yanase <toshihiko.yanase@gmail.com>
# NOTE For dask dataframes
# https://github.com/scikit-learn/scikit-learn/blob/ \
# 8caa93889f85254fc3ca84caa0a24a1640eebdd1/sklearn/utils/validation.py#L155-L158
if hasattr(x, 'shape') and x.shape is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy throws the following errors. (https://app.circleci.com/jobs/github/optuna/optuna/29090)

#!/bin/bash -eo pipefail
. venv/bin/activate
mypy --disallow-untyped-defs --ignore-missing-imports .
optuna/integration/sklearn.py:134: error: Item "List[Any]" of "Union[List[Any], Any, Any]" has no attribute "shape"
optuna/integration/sklearn.py:135: error: Item "List[Any]" of "Union[List[Any], Any, Any]" has no attribute "shape"
optuna/integration/sklearn.py:136: error: Item "List[Any]" of "Union[List[Any], Any, Any]" has no attribute "shape"
optuna/integration/sklearn.py:136: error: Incompatible return value type (got "Integral", expected "int")
Found 4 errors in 1 file (checked 121 source files)

Exited with code exit status 1

ArrayLikeType is defined here and I don't understand why these errors occur. 🤕
(ArrayLikeType = Union[List, np.ndarray, pd.Series])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy cannot infer the type based on hasattr.
How about using getattr? It is suggested here.

Example:

    x_shape = getattr(x, 'shape', None)
    if x_shape is not None:
        if isinstance(x_shape[0], Integral):
            return int(x_shape[0])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works. Thank you!

Copy link
Member
@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated the mypy error and I think I can find a workaround.

# NOTE For dask dataframes
# https://github.com/scikit-learn/scikit-learn/blob/ \
# 8caa93889f85254fc3ca84caa0a24a1640eebdd1/sklearn/utils/validation.py#L155-L158
if hasattr(x, 'shape') and x.shape is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy cannot infer the type based on hasattr.
How about using getattr? It is suggested here.

Example:

    x_shape = getattr(x, 'shape', None)
    if x_shape is not None:
        if isinstance(x_shape[0], Integral):
            return int(x_shape[0])

# NOTE Original implementation:
# https://github.com/scikit-learn/scikit-learn/blob/ \
# 8caa93889f85254fc3ca84caa0a24a1640eebdd1/sklearn/utils/validation.py#L217-L234
# It removed the check if an input is scipy sparse matrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I couldn't get the point of this comment.
When I checked the difference between the original code and this code, I understood that the latter one does not have the conversion from scipy sparse matrix to csr. Could you mention it and add the reason?

Copy link
Member Author
@himkt himkt Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ambiguous comment...

the latter one does not have the conversion from scipy sparse matrix to csr.

Yes, you're right.
Actually, I ignored a sparse matrix because I didn't find any use case. 🤕
(In 9c6da5d, I added the support for a sparse matrix and removed the comment in
ee7a022)

@codecov-io
Copy link
codecov-io commented Mar 2, 2020

Codecov Report

Merging #952 into master will increase coverage by 0.01%.
The diff coverage is 82.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #952      +/-   ##
==========================================
+ Coverage   90.15%   90.17%   +0.01%     
==========================================
  Files         112      114       +2     
  Lines        9306     9548     +242     
==========================================
+ Hits         8390     8610     +220     
- Misses        916      938      +22
Impacted Files Coverage Δ
tests/integration_tests/test_sklearn.py 100% <100%> (ø) ⬆️
optuna/integration/sklearn.py 75.23% <72.72%> (-0.37%) ⬇️
optuna/trial.py 87.05% <0%> (-0.72%) ⬇️
...ration_tests/lightgbm_tuner_tests/test_optimize.py 98.09% <0%> (-0.33%) ⬇️
setup.py 0% <0%> (ø) ⬆️
optuna/exceptions.py 100% <0%> (ø) ⬆️
optuna/visualization/parallel_coordinate.py 92.3% <0%> (ø) ⬆️
optuna/logging.py 93.65% <0%> (ø) ⬆️
optuna/structs.py 94.11% <0%> (ø) ⬆️
optuna/samplers/grid.py 86.76% <0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9ddee3...ac86f6a. Read the comment docs.

@himkt
Copy link
Member Author
himkt commented Mar 3, 2020

@toshihikoyanase
I revised my PR to include the following changes:

  • fixing a wrong type hint d901975
  • adding tests for _make_indexable 87ad173
  • updating the comment to be a sentence 8956f4c
  • fixing the wrong URL in the comment ac86f6a

Could you please take a look? 🙇

For old fashion type hints, I'd like to create a follow-up PR
since I think it would be better to update all type hints in sklearn.py and test_sklearn.py at the same time. (but it should be done in another PR)
What do you think?

Copy link
Member
@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himkt

LGTM. Thank you for your update.

For old fashion type hints, I'd like to create a follow-up PR
since I think it would be better to update all type hints in sklearn.py and test_sklearn.py at the same time. (but it should be done in another PR)

That makes sense. Let's update the type hints in a new PR.

@toshihikoyanase
Copy link
Member

@Y-oHr-N This PR will introduce some private methods of scikit-learn to optuna/integration/sklearn.py. Instead of just copying the implementation from scikit-learn, it simplifies the methods in terms of maintenance costs and testability. But I'm not 100% sure about the simplification. Please let us know if you have any comments on this implementation.

@Y-oHr-N
Copy link
Contributor
Y-oHr-N commented Mar 9, 2020

sklearn.utils.safe_indexing is a private function since version 0.22, so you may need to implement it.

-     from sklearn.utils import safe_indexing as sklearn_safe_indexing
+     if sklearn_version >= "0.22":
+         from sklearn.utils import _safe_indexing as sklearn_safe_indexing
+     else:
+         from sklearn.utils import safe_indexing as sklearn_safe_indexing

@toshihikoyanase
Copy link
Member

@Y-oHr-N Thank you for pointing it out! It is deprecated and will be removed in 0.24. IMO, we can work on it in a new PR because we still have some time.

https://github.com/scikit-learn/scikit-learn/blob/0.22.X/sklearn/utils/__init__.py#L292-L294

@deprecated("safe_indexing is deprecated in version "
            "0.22 and will be removed in version 0.24.")
def safe_indexing(X, indices, axis=0):

@Y-oHr-N
Copy link
Contributor
Y-oHr-N commented Mar 10, 2020

@toshihikoyanase, You're right.

There are no other comments. LGTM.

@toshihikoyanase
Copy link
Member

@Y-oHr-N I create an issue about safe_indexing(#1004).
Thank you for your review.

@toshihikoyanase toshihikoyanase self-assigned this Mar 10, 2020
@hvy hvy self-assigned this Mar 10, 2020
Copy link
Member
@hvy hvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@hvy hvy merged commit 961e7ae into optuna:master Mar 10, 2020
@hvy hvy added the code-fix Change that does not change the behavior, such as code refactoring. label Mar 10, 2020
@hvy hvy added this to the v1.3.0 milestone Mar 10, 2020
@hvy
Copy link
Member
hvy commented Mar 10, 2020

Not entirely sure about the PR labeling but I added one for now. Let me just modify the title to match our release note format.

@hvy hvy changed the title Implement some methods almost compatible with Scikit-learn private methods Implement some methods almost compatible with Scikit-learn private methods. Mar 10, 2020
@himkt himkt deleted the numpy-privates branch March 10, 2020 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-fix Change that does not change the behavior, such as code refactoring.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0