8000 Increase speed plot_birch_vs_minibatchkmeans.py by Iglesys347 · Pull Request #21703 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Increase speed plot_birch_vs_minibatchkmeans.py #21703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 19, 2021
Merged

Increase speed plot_birch_vs_minibatchkmeans.py #21703

merged 9 commits into from
Nov 19, 2021

Conversation

Iglesys347
Copy link
Contributor

Reference Issues/PRs

References #21598

What does this implement/fix? Explain your changes.

Reduced the number of samples (n_samples) in the make_blobs function.

Also changed the batch_size in MiniBatchKMeans. The documentation of MiniBatchKMeans says : "For faster compuations, you can set the batch_size greater than 256 * number of cores to enable parallelism on all cores.".

The purpose of those changes is to icrease the execution speed.

Here are the output and the time taken by the script before the changes (the time taken has been measured with the unix command time):

BIRCH without global clustering as the final step took 3.27 seconds
n_clusters : 158
BIRCH with global clustering as the final step took 3.25 seconds
n_clusters : 100
Time taken to run MiniBatchKMeans 4.02 seconds

real    0m13,758s
user    0m14,533s
sys     0m1,728s

And here the resulting plot:

birch_fig1 png

Now the output and the time taken by the script after the changes:

BIRCH without global clustering as the final step took 1.06 seconds
n_clusters : 158
BIRCH with global clustering as the final step took 1.07 seconds
n_clusters : 100
Time taken to run MiniBatchKMeans 0.86 seconds

real    0m6,032s
user    0m7,936s
sys     0m2,008s

And the plot:

birch_fig2

@ogrisel
Copy link
Member
ogrisel commented Nov 18, 2021

I think the point of this example is to compare algorithms that have no problem running on datasets with hundreds of thousands (or even millions) of data points. If think it's find to have an example that lasts ~10s to demonstrates scalability of estimators on data with larg-ish number of samples.

Therefore I would rather not change this example. WDYT @adrinjalali?

@adrinjalali
Copy link
Member

I kinda agree with you @ogrisel . But I'd also be happy with this change, plus a note saying that the number of samples can be extended to a few hundred thousands w/o any issue, but not have it in the CI. WDYT?

@adrinjalali adrinjalali changed the title Increase speed plot birch vs minibatchkmeans Increase speed plot_birch_vs_minibatchkmeans.py Nov 18, 2021
@adrinjalali adrinjalali mentioned this pull request Nov 18, 2021
41 tasks
Iglesys347 and others added 3 commits November 18, 2021 16:38
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's wait for the CI to complete, just in case.

@adrinjalali
Copy link
Member

@Iglesys347 could you please merge the latest main to make sure the docs are generated correctly?

Iglesys347 and others added 2 commits November 18, 2021 19:24
@Iglesys347
Copy link
Contributor Author

@adrinjalali @ogrisel All good ! Thank you both for your comments.

@ogrisel ogrisel merged commit 5856205 into scikit-learn:main Nov 19, 2021
@ogrisel
Copy link
Member
ogrisel commented Nov 19, 2021

Runtime has been halved on the CI (from less than 8s to less than 4s). Thanks for the contribution.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0