-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Increase speed plot_birch_vs_minibatchkmeans.py #21703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase speed plot_birch_vs_minibatchkmeans.py #21703
Conversation
…dividing by 2 parameters n_samples, n_features, rank
I think the point of this example is to compare algorithms that have no problem running on datasets with hundreds of thousands (or even millions) of data points. If think it's find to have an example that lasts ~10s to demonstrates scalability of estimators on data with larg-ish number of samples. Therefore I would rather not change this example. WDYT @adrinjalali? |
I kinda agree with you @ogrisel . But I'd also be happy with this change, plus a note saying that the number of samples can be extended to a few hundred thousands w/o any issue, but not have it in the CI. WDYT? |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's wait for the CI to complete, just in case.
@Iglesys347 could you please merge the latest |
Retrieve latest chagement
…se_speed_plot_birch_vs_minibatchkmeans
@adrinjalali @ogrisel All good ! Thank you both for your comments. |
Runtime has been halved on the CI (from less than 8s to less than 4s). Thanks for the contribution. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Reference Issues/PRs
References #21598
What does this implement/fix? Explain your changes.
Reduced the number of samples (
n_samples
) in themake_blobs
function.Also changed the
batch_size
inMiniBatchKMeans
. The documentation ofMiniBatchKMeans
says : "For faster compuations, you can set the batch_size greater than 256 * number of cores to enable parallelism on all cores.".The purpose of those changes is to icrease the execution speed.
Here are the output and the time taken by the script before the changes (the time taken has been measured with the unix command
time
):And here the resulting plot:
Now the output and the time taken by the script after the changes:
And the plot: