[MRG] Creation of a v0 ARPACK initialization function #11524

summer-bebop · 2018-07-15T10:18:26Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR creates a function sklearn.utils.init_arpack_v0(size, random_state) which goal is to be used each time eigs, eigsh or svds from scipy.sparse.linalg is called. It initializes the v0 parameter correctly with value sampled from the uniform distribution in [-1, 1] (like in ARPACK) to avoid convergence issues with another initialization. The v0 parameter is mandatory as it is the only way to render linalg functions behaviour deterministic.

Any other comments?

I put the function in __init__py as I have seen that some general utils functions are there but I'm not convinced by this choice. Maybe a utils.utils could be created to contain one shot functions which don't belong to a group ?
For now I just replaced places where randomization was correctly set using v0 parameter.
TODO :

svds calls are not correctly seeded with v0 --> change that
git grep v0 to check that everything is correct
Create unit test for init_arpack_v0

@amueller @rth : Should I change some tests or define new ones for the functions and classes that have been changed (most notably those calling svds without random_state defined) ? It seems that random_state params are not often checked by test so maybe leave it like that ?

I opened an issue for scipy to add a seed parameter, I'm probably going to take care of it after this PR. Maybe there are some impacts on this work but I don't think so.

There are still some stuffs bothering me but I can't find a better solution :

the way I'm forced (at least to my understanding) to initialize multiple times v0 in bicluster.py
the fact that _init_arpack_v0 is fed with A.shape[0] for eigsh and min(A.shape) for svds (this one causes the first annoyance)

sklearn/cluster/bicluster.py

jeremiedbb · 2018-07-15T10:49:48Z

For your TODO list, you can make a check list like this
[] todo 1
[] todo 2
...

sklearn/utils/__init__.py

jeremiedbb · 2018-07-15T21:41:24Z

For the unit test of _init_arpack_v0, you can generate several v0 with different random_state and check that they are not all equals. You can also check that they are sampled as expected, using pytest.approx on the mean and std for example.

jeremiedbb · 2018-07-15T21:42:29Z

@rth do you know what's the failure on LGTM ?

amueller · 2018-07-15T22:38:50Z

lgtm is broken right now :-/

jeremiedbb · 2018-07-17T10:00:32Z

LGTM

summer-bebop · 2018-07-18T23:44:43Z

Any idea what the codecov fail mean ?

rth

Test would need to be added for the lines that are currently not covered (or the corresponding if branches removed if possible).

sklearn/cluster/bicluster.py

sklearn/utils/__init__.py

sklearn/cluster/bicluster.py

jeremiedbb

Just noticed that there is an arpack submodule sklearn/utils/arpack.py.
Although all functions inside are in deprecation cycle, I think _init_arpack_v0 should be in there.
And if you make that scipy PR, this function would eventually be deprecated one day.

summer-bebop · 2018-08-07T12:36:23Z

Are you sure about this ? Because the first line of arpack.py says that this file is to be removed in version 0.21. That's why I did not put it there. Should I do the change or do we merge like that ?

rth · 2018-08-07T12:43:58Z

Are you sure about this ? Because the first line of arpack.py says that this file is to be removed in version 0.21. That's why I did not put it there. Should I do the change or do we merge like that ?

Yes, feel free to remove that comment and put it there. Only the deprecated functions will then be removed.

summer-bebop · 2018-08-10T09:29:52Z

Any idea on the error ? This seems completely unrelated, should I resubmit ?

rth

The CI fails because you need to fix the following docstring,

______________ [doctest] sklearn.cross_decomposition.pls_.PLSSVD _______________
[..]
798     >>> plsca = PLSSVD(n_components=2)
799     >>> plsca.fit(X, Y)
Expected:
    PLSSVD(copy=True, n_components=2, scale=True)
Got:
    PLSSVD(copy=True, n_components=2, random_state=None, scale=True)

(adding the random_state to the expected output)

Otherwise LGTM, thanks!

Please add add .. versionadded:: 0.21 below the random_state in docstrings where this parameter was added.

Also please add a what's new entry.

doc/whats_new/v0.21.rst

jnothman · 2018-11-06T01:54:53Z

doc/whats_new/v0.21.rst

@@ -116,6 +116,14 @@ Support for Python 3.4 and below has been officially dropped.
  and :class:`tree.ExtraTreeRegressor`.
  :issue:`12300` by :user:`Adrin Jalali <adrinjalali>`.

+:mod:`sklearn.utils`
+..................
+- New :func:`utils._init_arpack_v0` which goal is to be used each time eigs,


This creates a private method. We only need to document user-facing changes in what's new.

All right then I'll remove that

jnothman · 2018-11-06T01:55:29Z

sklearn/decomposition/truncated_svd.py

@@ -160,7 +161,8 @@ def fit_transform(self, X, y=None):
        random_state = check_random_state(self.random_state)

        if self.algorithm == "arpack":
-            U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
+            v0 = _init_arpack_v0(min(X.shape), self.random_state)


This is going to change the results of the function.

Is it beneficial and worth breaking backwards compatibility?

If so, it needs to be documented in the change log.

Without this initialization, the v0 is initialized with randn. In arpack it is initialized with uniform distrib on [-1, 1]. The goal is also to generalize the initialization of the calls to arpack lib and provide a random state.

What should be documented ? For each impacted algorithm : "Initialization of v0 is now uniform [-1, 1]" something of this flavour ?

jnothman · 2018-11-06T01:56:24Z

sklearn/cross_decomposition/pls_.py

@@ -844,7 +856,8 @@ def fit(self, X, Y):
        if self.n_components >= np.min(C.shape):
            U, s, V = svd(C, full_matrices=False)
        else:
-            U, s, V = svds(C, k=self.n_components)
+            v0 = _init_arpack_v0(min(C.shape), self.random_state)


This may change the result (as per my comment in truncated svd below)

jnothman · 2018-11-06T01:56:43Z

sklearn/cross_decomposition/pls_.py

@@ -761,6 +762,15 @@ class PLSSVD(BaseEstimator, TransformerMixin):
    copy : boolean, default True
        Whether to copy X and Y, or perform in-place computations.

+    random_state : int, RandomState instance or None, optional (default=None)
+        The seed of the pseudo random number generator to use when shuffling


Not used when shuffling the data. Please fix.

jnothman · 2019-05-01T06:52:18Z

@FollowKenny are you working on those comments?

summer-bebop · 2019-05-02T06:33:53Z

I think I have everything ready to push but I'm still not sure about the last two points you raised in your last comment.
I can push my work next week, as I'm in vacation right now.

imnotaqtpie · 2019-12-06T05:22:16Z

hello @rth , @jnothman , i will be taking over this issue from @FollowKenny . I am new to git so please bear with any newbie mistakes i may make :) . thanks.
Also, do i continue to work from branch 0.21.X?

cmarmo · 2020-05-29T09:46:05Z

Hi @imnotaqtpie, sorry, I know it took a while to come back to you. Are you still interested in taking over? Do you need some guidance? Thanks and sorry again.

gauravkdesai · 2020-08-30T07:39:35Z

take

jeremiedbb reviewed Jul 15, 2018

View reviewed changes

sklearn/cluster/bicluster.py Outdated Show resolved Hide resolved

sklearn/cluster/bicluster.py Outdated Show resolved Hide resolved

rth reviewed Jul 15, 2018

View reviewed 8000 changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

summer-bebop changed the title ~~[WIP] Creation of a v0 ARPACK initialization function~~ [MRG] Creation of a v0 ARPACK initialization function Jul 17, 2018

summer-bebop force-pushed the arpack branch from 9d1f76f to 392dc1a Compare July 17, 2018 08:59

rth reviewed Jul 19, 2018

View reviewed changes

sklearn/cluster/bicluster.py Outdated Show resolved Hide resolved

sklearn/cluster/bicluster.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/cluster/bicluster.py Outdated Show resolved Hide resolved

jeremiedbb approved these changes Jul 19, 2018

View reviewed changes

jeremiedbb requested changes Jul 20, 2018

View reviewed changes

rth reviewed Sep 1, 2018

View reviewed changes

Ivan PANICO and others added 13 commits November 5, 2018 22:54

Arpack initialization modif

b3ae546

fixes

4099f91

making function private and correcting typo

0b8eec3

with the svds correct init

68ecfaf

with pep8 flake8 and pytest corrections

67d5632

unit test on init_arpack

b81afe5

pep8 correction

666c22b

change approx for allclose to pass travis

d314984

with flake8

128aecc

double init sadly

b450b03

changing _init_arpack_v0 location

a16e52f

Fix PLSCA docstring and add versionadded mention in PLSSVD

b0281d3

Add a whats_new.

c6ea0bd

summer-bebop force-pushed the arpack branch from 5ca6b26 to c6ea0bd Compare November 5, 2018 22:18

jeremiedbb reviewed Nov 5, 2018

View reviewed changes

doc/whats_new/v0.21.rst Show resolved Hide resolved

jeremiedbb approved these changes Nov 5, 2018

View reviewed changes

summer-bebop added 3 commits November 6, 2018 00:16

fixes the right docstring... And convert some weird tabs into spaces...

054e9b7

Adds :issue: and :user: entry

3f8bfc2

I'm out of words...

bbc38f9

jnothman reviewed < 9E88 relative-time datetime="2018-11-06T01:57:43Z" class="no-wrap">Nov 6, 2018

View reviewed changes

rth added the Waiting for Reviewer label Jul 25, 2019

rth self-requested a review July 25, 2019 22:03

jeremiedbb mentioned this pull request Nov 29, 2019

ENH: make consistent interface for scipy's eigs, eigsh and svds with random start #5545

Closed

cmarmo added Stalled help wanted labels Jun 2, 2020

cmarmo removed the Waiting for Reviewer label Jun 18, 2020

github-actions bot assigned gauravkdesai Aug 30, 2020

github-actions bot removed the help wanted label Aug 30, 2020

gauravkdesai mentioned this pull request Aug 30, 2020

MNT initialize weights when using ARPACK solver with a utility #18302

Merged

cmarmo added the Superseded PR has been replace by a newer PR label Aug 30, 2020

ogrisel closed this in #18302 Nov 27, 2020

Uh oh!

[MRG] Creation of a v0 ARPACK initialization function #11524

[MRG] Creation of a v0 ARPACK initialization function #11524

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!