-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Linear models take unreasonable longer time in certain data size. #10813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting, I can reproduce the behaviour and I don't know why this happens. If I were you I would try to profile to see whether you can learn something from it. Note that if I remove the |
I assume the number of iterations is very different, not just the wall
time?
|
On my machine |
So essentially in some cases, this example does not seem to converge. Lasso optimizes the Elastic Net objective function that is convex, and the used coordinate descent solver should theoretically always converge. Since it doesn't, the only thing I can think of is that it's due to some numerical issues. Using numbers of the order of 1e50 definitely does not help. As suggested by @lesteve, just scaling the input data should address this issue. This is also mentionned in the user guide: agreed linear models are generally more robust to feature scaling but maybe not up to 1e50 order of magnitude... |
Being robust to feature scaling is a theoretical property. All our
implementations necessarily use fixed precision numbers.
I think all we can do is try to improve the identification of bad datasets
and warn. But I'm not sure how to do that when an overflow flag is not
explicitly set.
|
Yes, I think adding a convergence warning would have helped; as mentioned in #10813 (comment) there is no other immediately straightforward way to detect this issue otherwise. I think scikit-learn/sklearn/linear_model/cd_fast.pyx Line 320 in 3e26fc6
and other |
Since this issue has been tagged as a good first issue, I would be happy to help on that problem! |
I'm quite new to this (I've been an sklearn end-user only so far) so I would really appreciate some guidance for my first contribution. I took a look at the code and it looks like a potential place to raise the warning could be in coordinate_descent.py, after line 481:
There the model object contains all 3 variables to check for lack of convergence (namely dual_gap_, eps_ and n_iter_). |
2 additional findings:
Also as mentioned, we are fitting with alpha=1. for a problem where the |
Uh oh!
There was an error while loading. Please reload this page.
Description
When I use Lasso to fit an artificial data set, the running time has a weird pattern: When the data size is 15900 * 500 or 161000 * 500, it takes less than 2 seconds. However, when the size is 16000 * 500, it becomes more than 20 seconds. It totally makes no sense.
The experiment is repeatable. I am using sklearn 0.19.0, I tried the same program on Windows, Mac OS and Linux, all of them has this problem.
This problem appears only when the input data is large numbers. In this example I use 1e50.
Other Linear models like Ridge also have this problem.
Steps/Code to Reproduce
Actual Output
Versions
Windows-10-10.0.16299-SP0
Python 3.6.2 |Anaconda custom (64-bit)| (default, Sep 19 2017, 08:03:39) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0
The text was updated successfully, but these errors were encountered: