MNT speed up example plot_digits_pipe.py #21728

ArthDh · 2021-11-21T07:29:50Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Timed using: time -p python plot_digits_pipe.py
Time taken by original code (without the plotting part) -

real 6.40
user 54.59
sys 4.18

Output:

Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}

After updating code [Reduced max iterations, increased tolerance, working with a subset] (without plotting) -

real 2.73
user 15.67
sys 2.30

Output:

Best parameter (CV score=0.942):
{'logistic__C': 1.0, 'pca__n_components': 45}

Any other comments?

glemaitre · 2021-11-22T18:36:06Z

examples/compose/plot_digits_pipe.py

 # Define a pipeline to search for the best combination of PCA truncation
 # and classifier regularization.
 pca = PCA()
 # set the tolerance to a large value to make the example faster
-logistic = LogisticRegression(max_iter=10000, tol=0.1)
+logistic = LogisticRegression(max_iter=1000, tol=0.2)
 pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)])


Actually, we make a 101 rookie mistake here. The data are not scaled. We should not modify anything apart of adding a StandardScaler as a preprocessing stage:

Suggested change

pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)])

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])

I have a x5 speed up just by scaling the data because the LogisticRegression is converging faster.

I tried a few different versions, here are the timing results:

Original:

Best parameter (CV score=0.920): {'logistic__C': 0.046415888336127774, 'pca__n_components': 45} real 7.33 user 58.78 sys 6.04

Using Standard Scaler:

Best parameter (CV score=0.924): {'logistic__C': 0.046415888336127774, 'pca__n_components': 60} real 3.46 user 24.25 sys 3.13

The graph changes:

Using Subset + higher tolerance:

Best parameter (CV score=0.942): {'logistic__C': 1.0, 'pca__n_components': 45} real 2.74 user 15.70 sys 2.22

Do you think we should proceed with the Standard Scalar?
Also, what does MNT in the title imply(Apologies for the incorrect title, I couldn't find MNT in the contributing docs)?

Do you think we should proceed with the Standard Scalar?

Yes, when I tried it was working quite well. We might need to check the description if the number of hyperparameter changes.

MNT -> Maintenance

I tend to use DOC on these PRs though, but no strong feelings.

glemaitre

LGTM

* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

Updated plot_digits_pipe

f7fd693

glemaitre reviewed Nov 22, 2021

View reviewed changes

glemaitre changed the title ~~[MRG] Speedsup plot_digits_pipe example~~ MNT speed up example plot_digits_pipe.py Nov 22, 2021

adrinjalali mentioned this pull request Nov 22, 2021

Accelerate slow examples #21598

Closed

41 tasks

Updated plot_digits_pipe with StandardScaler preprocessing

95e813f

ArthDh requested review from glemaitre and adrinjalali November 24, 2021 16:29

glemaitre approved these changes Nov 24, 2021

View reviewed changes

adrinjalali merged commit c040ab1 into scikit-learn:main Nov 25, 2021

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021

DOC speed up example plot_digits_pipe.py (scikit-learn#21728)

9ce63eb

* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

DOC speed up example plot_digits_pipe.py (scikit-learn#21728)

0bb859f

* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021

DOC speed up example plot_digits_pipe.py (scikit-learn#21728)

c3da2b1

* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

glemaitre pushed a commit that referenced this pull request Dec 25, 2021

DOC speed up example plot_digits_pipe.py (#21728)

79e7e63

* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT speed up example plot_digits_pipe.py #21728

MNT speed up example plot_digits_pipe.py #21728

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)])
+from sklearn.preprocessing import StandardScaler
+scaler = StandardScaler()
+pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])

Uh oh!

MNT speed up example plot_digits_pipe.py #21728

MNT speed up example plot_digits_pipe.py #21728

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!