-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT speed up example plot_digits_pipe.py #21728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/compose/plot_digits_pipe.py
Outdated
# Define a pipeline to search for the best combination of PCA truncation | ||
# and classifier regularization. | ||
pca = PCA() | ||
# set the tolerance to a large value to make the example faster | ||
logistic = LogisticRegression(max_iter=10000, tol=0.1) | ||
logistic = LogisticRegression(max_iter=1000, tol=0.2) | ||
pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we make a 101 rookie mistake here. The data are not scaled. We should not modify anything apart of adding a StandardScaler
as a preprocessing stage:
pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)]) | |
from sklearn.preprocessing import StandardScaler | |
scaler = StandardScaler() | |
pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)]) |
I have a x5 speed up just by scaling the data because the LogisticRegression
is converging faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few different versions, here are the timing results:
- Original:
Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
real 7.33
user 58.78
sys 6.04
- Using Standard Scaler:
Best parameter (CV score=0.924):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 60}
real 3.46
user 24.25
sys 3.13
- Using Subset + higher tolerance:
Best parameter (CV score=0.942):
{'logistic__C': 1.0, 'pca__n_components': 45}
real 2.74
user 15.70
sys 2.22
Do you think we should proceed with the Standard Scalar?
Also, what does MNT in the title imply(Apologies for the incorrect title, I couldn't find MNT in the contributing docs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should proceed with the Standard Scalar?
Yes, when I tried it was working quite well. We might need to check the description if the number of hyperparameter changes.
MNT -> Maintenance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to use DOC on these PRs though, but no strong feelings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
Reference Issues/PRs
#21598
What does this implement/fix? Explain your changes.
Timed using:
time -p python plot_digits_pipe.py
Time taken by original code (without the plotting part) -
Output:
After updating code [Reduced max iterations, increased tolerance, working with a subset] (without plotting) -
Output:
Any other comments?