-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Creation of a v0 ARPACK initialization function #11524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For your TODO list, you can make a check list like this |
For the unit test of _init_arpack_v0, you can generate several v0 with different random_state and check that they are not all equals. You can also check that they are sampled as expected, using pytest.approx on the mean and std for example. |
@rth do you know what's the failure on LGTM ? |
lgtm is broken right now :-/ |
LGTM |
Any idea what the codecov fail mean ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test would need to be added for the lines that are currently not covered (or the corresponding if
branches removed if possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that there is an arpack submodule sklearn/utils/arpack.py.
Although all functions inside are in deprecation cycle, I think _init_arpack_v0 should be in there.
And if you make that scipy PR, this function would eventually be deprecated one day.
Are you sure about this ? Because the first line of arpack.py says that this file is to be removed in version 0.21. That's why I did not put it there. Should I do the change or do we merge like that ? |
Yes, feel free to remove that comment and put it there. Only the deprecated functions will then be removed. |
Any idea on the error ? This seems completely unrelated, should I resubmit ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI fails because you need to fix the following docstring,
______________ [doctest] sklearn.cross_decomposition.pls_.PLSSVD _______________
[..]
798 >>> plsca = PLSSVD(n_components=2)
799 >>> plsca.fit(X, Y)
Expected:
PLSSVD(copy=True, n_components=2, scale=True)
Got:
PLSSVD(copy=True, n_components=2, random_state=None, scale=True)
(adding the random_state
to the expected output)
Otherwise LGTM, thanks!
Please add add .. versionadded:: 0.21
below the random_state
in docstrings where this parameter was added.
Also please add a what's new entry.
@@ -116,6 +116,14 @@ Support for Python 3.4 and below has been officially dropped. | |||
and :class:`tree.ExtraTreeRegressor`. | |||
:issue:`12300` by :user:`Adrin Jalali <adrinjalali>`. | |||
|
|||
:mod:`sklearn.utils` | |||
.................. | |||
- New :func:`utils._init_arpack_v0` which goal is to be used each time eigs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a private method. We only need to document user-facing changes in what's new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right then I'll remove that
@@ -160,7 +161,8 @@ def fit_transform(self, X, y=None): | |||
random_state = check_random_state(self.random_state) | |||
|
|||
if self.algorithm == "arpack": | |||
U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol) | |||
v0 = _init_arpack_v0(min(X.shape), self.random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to change the results of the function.
- Is it beneficial and worth breaking backwards compatibility?
- If so, it needs to be documented in the change log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Without this initialization, the v0 is initialized with randn. In arpack it is initialized with uniform distrib on [-1, 1]. The goal is also to generalize the initialization of the calls to arpack lib and provide a random state.
- What should be documented ? For each impacted algorithm : "Initialization of v0 is now uniform [-1, 1]" something of this flavour ?
@@ -844,7 +856,8 @@ def fit(self, X, Y): | |||
if self.n_components >= np.min(C.shape): | |||
U, s, V = svd(C, full_matrices=False) | |||
else: | |||
U, s, V = svds(C, k=self.n_components) | |||
v0 = _init_arpack_v0(min(C.shape), self.random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may change the result (as per my comment in truncated svd below)
@@ -761,6 +762,15 @@ class PLSSVD(BaseEstimator, TransformerMixin): | |||
copy : boolean, default True | |||
Whether to copy X and Y, or perform in-place computations. | |||
|
|||
random_state : int, RandomState instance or None, optional (default=None) | |||
The seed of the pseudo random number generator to use when shuffling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used when shuffling the data. Please fix.
@FollowKenny are you working on those comments? |
I think I have everything ready to push but I'm still not sure about the last two points you raised in your last comment. |
Hi @imnotaqtpie, sorry, I know it took a while to come back to you. Are you still interested in taking over? Do you need some guidance? Thanks and sorry again. |
take |
Reference Issues/PRs
Fixes #5545
What does this implement/fix? Explain your changes.
This PR creates a function sklearn.utils.init_arpack_v0(size, random_state) which goal is to be used each time eigs, eigsh or svds from scipy.sparse.linalg is called. It initializes the v0 parameter correctly with value sampled from the uniform distribution in [-1, 1] (like in ARPACK) to avoid convergence issues with another initialization. The v0 parameter is mandatory as it is the only way to render linalg functions behaviour determin 8000 istic.
Any other comments?
I put the function in __init__py as I have seen that some general utils functions are there but I'm not convinced by this choice. Maybe a utils.utils could be created to contain one shot functions which don't belong to a group ?
For now I just replaced places where randomization was correctly set using v0 parameter.
TODO :
@amueller @rth : Should I change some tests or define new ones for the functions and classes that have been changed (most notably those calling svds without random_state defined) ? It seems that random_state params are not often checked by test so maybe leave it like that ?
I opened an issue for scipy to add a seed parameter, I'm probably going to take care of it after this PR. Maybe there are some impacts on this work but I don't think so.
There are still some stuffs bothering me but I can't find a better solution :