8000 Linear models take unreasonable longer time in certain data size. · Issue #10813 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Linear models take unreasonable longer time in certain data size. #10813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LeeLeeYeah opened this issue Mar 15, 2018 · 10 comments · Fixed by #11754
Closed

Linear models take unreasonable longer time in certain data size. #10813

LeeLeeYeah opened this issue Mar 15, 2018 · 10 comments · Fixed by #11754
Labels
Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted

Comments

@LeeLeeYeah
Copy link
LeeLeeYeah commented Mar 15, 2018

Description

When I use Lasso to fit an artificial data set, the running time has a weird pattern: When the data size is 15900 * 500 or 161000 * 500, it takes less than 2 seconds. However, when the size is 16000 * 500, it becomes more than 20 seconds. It totally makes no sense.

The experiment is repeatable. I am using sklearn 0.19.0, I tried the same program on Windows, Mac OS and Linux, all of them has this problem.

This problem appears only when the input data is large numbers. In this example I use 1e50.

Other Linear models like Ridge also have this problem.

Steps/Code to Reproduce

from sklearn import linear_model
import numpy as np
import time

estimator = linear_model.Lasso()
#dimension is fixed
dimension = 500
#sampleNumber range from 15500 to 16500
for sampleNumber in range(15400, 16500, 100):
    x = np.ones([sampleNumber, dimension]) * 1e50
    y = np.ones([sampleNumber]) * 1e50
    #measure running time
    start = time.time()
    estimator = estimator.fit(x, y)
    end = time.time()
    print("Sample Number:%d, Time: %f" % (sampleNumber, end-start))

Actual Output

Sample Number:15400, Time: 1.460229
Sample Number:15500, Time: 26.534857
Sample Number:15600, Time: 1.256980
Sample Number:15700, Time: 1.437125
Sample Number:15800, Time: 1.290524
Sample Number:15900, Time: 1.298007
Sample Number:16000, Time: 26.865131
Sample Number:16100, Time: 1.278997
Sample Number:16200, Time: 1.320840
Sample Number:16300, Time: 1.385144
Sample Number:16400, Time: 1.512907

Versions

Windows-10-10.0.16299-SP0
Python 3.6.2 |Anaconda custom (64-bit)| (default, Sep 19 2017, 08:03:39) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

@lesteve
Copy link
Member
lesteve commented Mar 15, 2018

Interesting, I can reproduce the behaviour and I don't know why this happens. If I were you I would try to profile to see whether you can learn something from it.

Note that if I remove the 1e50, the timings are similar for all sample numbers.

@jnothman
Copy link
Member
jnothman commented Mar 15, 2018 via email

@lesteve
Copy link
Member
lesteve commented Mar 15, 2018

On my machine estimator.n_iter_ is 2 in the fast cases and 1000 (i.e. n_iter_ == max_iter) in the slow cases.

@rth
Copy link
Member
rth commented Mar 16, 2018

So essentially in some cases, this example does not seem to converge.

Lasso optimizes the Elastic Net objective function that is convex, and the used coordinate descent solver should theoretically always converge. Since it doesn't, the only thing I can think of is that it's due to some numerical issues. Using numbers of the order of 1e50 definitely does not help.

As suggested by @lesteve, just scaling the input data should address this issue. This is also mentionned in the user guide: agreed linear models are generally more robust to feature scaling but maybe not up to 1e50 order of magnitude...

@jnothman
Copy link
Member
jnothman commented Mar 17, 2018 via email

@lesteve
Copy link
Member
lesteve commented Mar 26, 2018

Maybe a ConvergenceWarning could have helped as mentioned by @rth in #10866?

@rth
Copy link
Member
rth commented Apr 24, 2018

Yes, I think adding a convergence warning would have helped; as mentioned in #10813 (comment) there is no other immediately straightforward way to detect this issue otherwise.

I think

def sparse_enet_coordinate_descent(floating [:] w,

and other enet_coordinate_descent* solvers there should raise a convergence warning if the main loop exited without reaching the desired tolerance.

@rth rth added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted and removed Sprint labels Apr 24, 2018
@kajocina
Copy link
kajocina commented May 4, 2018

Since this issue has been tagged as a good first issue, I would be happy to help on that problem!

@kajocina
Copy link
kajocina commented May 7, 2018

I'm quite new to this (I've been an sklearn end-user only so far) so I would really appreciate some guidance for my first contribution. I took a look at the code and it looks like a potential place to raise the warning could be in coordinate_descent.py, after line 481:

coef_, dual_gap_, eps_, n_iter_ = model

There the model object contains all 3 variables to check for lack of convergence (namely dual_gap_, eps_ and n_iter_).
Do you think this is a valid place or should the ConvergenceWarning be within the .pyx files directly?

@mathurinm
Copy link
Contributor

2 additional findings:

  • we are fitting an (unregularized) intercept and y is constant. So the
    optimal solution should be np.zeros(n_features), whatever the value of
    alpha. This clearly makes the example a degenerate case.
    If I use fit_intercept=False, I get similar timings for all n_samples.

  • if I use fit_intercept=True, the solution in the cases which run
    fine all have 1 as first element, then nothing but zeros; estimator.dual_gap_
    is equal to n_samples
    For the 2 problematic solution, coef_ = 0 everywhere, and dual_gap_ is
    exactly 0 which is surprising

Also as mentioned, we are fitting with alpha=1. for a problem where the
smallest alpha giving a 0 solution is 1.5e104...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
7 participants
0