Increase speed plot_birch_vs_minibatchkmeans.py #21703
Increase speed plot_birch_vs_minibatchkmeans.py #21703ogrisel merged 9 commits intoscikit-learn:mainfrom Iglesys347:increase_speed_plot_birch_vs_minibatchkmeans
Conversation
…dividing by 2 parameters n_samples, n_features, rank
|
I think the point of this example is to compare algorithms that have no problem running on datasets with hundreds of thousands (or even millions) of data points. If think it's find to have an example that lasts ~10s to demonstrates scalability of estimators on data with larg-ish number of samples. Therefore I would rather not change this example. WDYT @adrinjalali? |
|
I kinda agree with you @ogrisel . But I'd also be happy with this change, plus a note saying that the number of samples can be extended to a few hundred thousands w/o any issue, but not have it in the CI. WDYT? |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
|
@Iglesys347 could you please merge the latest |
Retrieve latest chagement
…se_speed_plot_birch_vs_minibatchkmeans
|
@adrinjalali @ogrisel All good ! Thank you both for your comments. |
|
Runtime has been halved on the CI (from less than 8s to less than 4s). Thanks for the contribution. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Reference Issues/PRs
References #21598
What does this implement/fix? Explain your changes.
Reduced the number of samples (
n_samples) in themake_blobsfunction.Also changed the
batch_sizeinMiniBatchKMeans. The documentation ofMiniBatchKMeanssays : "For faster compuations, you can set the batch_size greater than 256 * number of cores to enable parallelism on all cores.".The purpose of those changes is to icrease the execution speed.
Here are the output and the time taken by the script before the changes (the time taken has been measured with the unix command
time):And here the resulting plot:
Now the output and the time taken by the script after the changes:
And the plot: