8000 MNT speed up example plot_digits_pipe.py by ArthDh · Pull Request #21728 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT speed up example plot_digits_pipe.py #21728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 25, 2021

Conversation

ArthDh
Copy link
Contributor
@ArthDh ArthDh commented Nov 21, 2021

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

Timed using: time -p python plot_digits_pipe.py
Time taken by original code (without the plotting part) -

real 6.40
user 54.59
sys 4.18

Output:

Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}

Figure_2
After updating code [Reduced max iterations, increased tolerance, working with a subset] (without plotting) -

real 2.73
user 15.67
sys 2.30

Output:

Best parameter (CV score=0.942):
{'logistic__C': 1.0, 'pca__n_components': 45}

Figure_1

Any other comments?

# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1)
logistic = LogisticRegression(max_iter=1000, tol=0.2)
pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we make a 101 rookie mistake here. The data are not scaled. We should not modify anything apart of adding a StandardScaler as a preprocessing stage:

Suggested change
pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])

I have a x5 speed up just by scaling the data because the LogisticRegression is converging faster.

Copy link
Contributor Author
@ArthDh ArthDh Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few different versions, here are the timing results:

  • Original:
Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
real 7.33
user 58.78
sys 6.04
  • Using Standard Scaler:
Best parameter (CV score=0.924):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 60}
real 3.46
user 24.25
sys 3.13

The graph changes:
scaled

  • Using Subset + higher tolerance:
Best parameter (CV score=0.942):
{'logistic__C': 1.0, 'pca__n_components': 45}
real 2.74
user 15.70
sys 2.22

Do you think we should proceed with the Standard Scalar?
Also, what does MNT in the title imply(Apologies for the incorrect title, I couldn't find MNT in the contributing docs)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should proceed with the Standard Scalar?

Yes, when I tried it was working quite well. We might need to check the description if the number of hyperparameter changes.

MNT -> Maintenance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to use DOC on these PRs though, but no strong feelings.

@glemaitre glemaitre changed the title [MRG] Speedsup plot_digits_pipe example MNT speed up example plot_digits_pipe.py Nov 22, 2021
@adrinjalali adrinjalali mentioned this pull request Nov 22, 2021
41 tasks
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrinjalali adrinjalali merged commit c040ab1 into scikit-learn:main Nov 25, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
* Updated plot_digits_pipe

* Updated plot_digits_pipe with StandardScaler preprocessing
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
* Updated plot_digits_pipe

* Updated plot_digits_pipe with StandardScaler preprocessing
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
* Updated plot_digits_pipe

* Updated plot_digits_pipe with StandardScaler preprocessing
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
* Updated plot_digits_pipe

* Updated plot_digits_pipe with StandardScaler preprocessing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0