8000 [MRG] Speed up plot_digits_linkage.py example #21598 by yarkhinephyo · Pull Request #21678 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Speed up plot_digits_linkage.py example #21598 #21678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 19, 2021
Merged

[MRG] Speed up plot_digits_linkage.py example #21598 #21678

merged 6 commits into from
Nov 19, 2021

Conversation

yarkhinephyo
Copy link
Contributor
@yarkhinephyo yarkhinephyo commented Nov 15, 2021

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

Speeds up ../examples/cluster/plot_digits_linkage.py from 32 sec to 20 sec by reducing the number of digits dataset samples from 1800 to 800.

Additionally, increased the font size of the numbers and added a random state for manifold.SpectralEmbedding.

Before:
image

After:
image

Any other comments?

Nil

@adrinjalali adrinjalali mentioned this pull request Nov 15, 2021
41 tasks
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you merge with the latest main, your CI would be green. Then we wait for a second reviewer to check the code :)

Copy link
Member
@jmloyola jmloyola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @yarkhinephyo!

I left a couple of comments in the code 🤓. Let me know what do you think.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main message of the example is quite clearly visible without the nudging data augmentation that also makes the code more complex for little benefit.

However the analysis could be improved to better reflect what we observe (both in main and in this branch). Let me suggest the following:

What this example shows us is the behavior "rich getting richer" of
agglomerative clustering that tends to create uneven cluster sizes.

This behavior is pronounced for the average linkage strategy,
that ends up with a couple of clusters with few datapoints.

The case of single linkage is even more pathologic with a very
large cluster covering most digits, an intermediate size (clean)
cluster with most zero digits and all other clusters being drawn
from noise points around the fringes.

The other linkage strategies lead to more evenly distributed 
clusters that are therefore likely to be less sensible to a
random resampling of the dataset.
```

@adrinjalali adrinjalali merged commit 13480ff into scikit-learn:main Nov 19, 2021
@yarkhinephyo yarkhinephyo deleted the speed-up-plot-digit-linkage branch November 19, 2021 09:46
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
@siavrez
Copy link
Contributor
siavrez commented Nov 22, 2021

Changing calls from matplotlib.pyplot.text to matplot.pyplot.scatter using Latex markers for digits speeds up the plotting process and cuts down the plotting runtime to 25% to 35% of the original time.
In this example from 31.1 seconds to 4.8 seconds in my local system.
Linkage

@jmloyola
Copy link
Member
jmloyola commented Nov 22, 2021

That's righ 8000 t @siavrez. I've just tested it and it runs 17 times faster.

The original implementation runs slower because we used plt.text for each data point. Here, @siavrez uses plt.scatter with LaTeX markers for each class. PR #21737

What do you think @ogrisel, @adrinjalali?

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0