-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] GaussianMixture with BIC/AIC #19562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nto GaussianMixtureIC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tliu68: this looks like a good addition! 🙂
I've left a few comments and suggestions.
@jjerphan Thank you for your review! I have made some updates according to your comments. In particular, do you think the new class I added |
@tliu68 Thanks for your update!
The private class you have added makes the implementation easier to understand and extend later on. |
@jjerphan When you get a chance, do you mind taking a look at the new update? Thanks! BTW, We have been working on the error that is causing some checks to fail. However, it seems to be related to |
Thanks for the update, @tliu68. I am currently busier, and it will takes some more days before I can get some time to review both the changes and the issue that you mention. If you know the commit this issue originates from, this would greatly helps. 🙂 |
Sure! Actually, I don't think the part that is causing the issue has been changed in those commits. So maybe the latest one? |
To clarify what we think is going on with the failing checks: one test in the estimator checks
We observed that:
So, this ultimately seems like a problem that only comes up in Advice on how to proceed is greatly appreciated! |
Your report is right @tliu68. This problem is not linked to your PR and relates to the current cython implementations of hierarchical clustering methods (as here MST-Linkage-Core) which do not support read-only buffers but should. As this problem has a bigger scope than your PR, I'll create an issue based on your snippet and inspect this problem. Thanks for your patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, here are a few suggestions in the meanwhile: we are nearly there! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tliu68!
Though, this is an usable method, I think this would be better integrated in scikit-learn-extra instead.
I guess we need to discuss it further with maintainers.
@@ -22,7 +22,7 @@ | |||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I've checked again and this is fixed.
@jjerphan We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that This PR is essentially a python port of R's Thank you again for your consideration! |
Hi @tliu68 (and @tathey1, @bdpedigo and @PerifanosPrometheus), I do acknowledge that To me, it is better for both maintenance and usability that scikit-learn focuses on the core algorithms for machine learning solely; this interface comes in handy to select a Gaussian mixture model based on a given criterion, but one could also create other interfaces to automatically select most performing models based on some criteria. Nevertheless, this is my point of view which is not the one from a maintainer; and both you and I are waiting for a authoritative ones. :) |
We do agree that Thank you for your patience and all of your help @jjerphan ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this PR a pass and I have the feeling that this is too ad-hoc to be part of scikit-learn under a new class.
From reading the discussion in #19338 I understand that there is a strong computational dependency between the new agglomerative clustering init schemes and the model selection procedure and the incremental regularization strategy.
However I have the feeling that it would be too weird that to have a GaussianMixtureIC
class with significantly richer init capabilities than the underlying GaussianMixture
class in scikit-learn. I also have the feeling that increasing the init capabilities of GaussianMixture
would introduce too much maintenance burden and cognitive load for docstring readers for little gain compared to the flexibility of passing arbitrary components centers.
I also find it unexpected that the average GaussianMixtureIC
training would be orders of magnitude slower than GaussianMixture
because the former is running automated model selection on the cartesian product of the hyperparameters of a rich model family.
I would therefore instead recommend to publish this implementation outside of scikit-learn, as an independently maintained Python package with released versions on PyPI installable via pip and conda-forge installable with conda along with its own documentation website with examples usages possibly using sphinx and sphinx-gallery to foster adoption among the Python data science community.
I would also recommend to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions.
Please find below some personal notes I made while reviewing the code.
cv_types = ["spherical", "tied", "diag", "full"] | ||
|
||
# Fit Gaussian mixture with EM for a range of n_components and cv_types | ||
gmIC = GaussianMixtureIC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
gmIC = GaussianMixtureIC( | |
gmic = GaussianMixtureIC( |
or gm_ic
if you prefer. The following lines will have to be updated accordingly.
# covariance type is not in ['spherical', 'diag', 'tied', 'full'] | ||
_test_inputs(X, ValueError, covariance_type="1") | ||
|
||
# euclidean is not an affinity option when ward is a linkage option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# euclidean is not an affinity option when ward is a linkage option | |
# manhattan is not an affinity option when the linkage is set to ward |
)( | ||
delayed(_fit_for_data)(ag_params, gm_params, seed) | ||
for (ag_params, gm_params), seed in zip(param_grid, seeds) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using joblib-based multi-threading really beneficial here?
Have you done some benchmarks?
I would suspect that BLAS level parallelism (in the underlying matrix matrix operations) might already bring multicore speed up and than additional Python level multithreading might not bring any benefit and could potentially cause oversubscription issues.
It would be interesting to see some performance results on a machine with many cores (e.g. 8 or 16 cores) for various values of n_jobs and with and without parallel_kwargs = {"prefer": "threads"}
.
criter = model.bic(X) | ||
else: | ||
criter = model.aic(X) | ||
break | ||
|
||
gm_params["reg_covar"] = reg_covar | ||
# change the precision of "criter" based on sample size | ||
criter = round(criter, int(np.log10(n_samples))) | ||
results = _CollectResults(model, criter, gm_params, ag_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
criter = model.bic(X) | |
else: | |
criter = model.aic(X) | |
break | |
gm_params["reg_covar"] = reg_covar | |
# change the precision of "criter" based on sample size | |
criter = round(criter, int(np.log10(n_samples))) | |
results = _CollectResults(model, criter, gm_params, ag_params) | |
criterion_value = model.bic(X) | |
else: | |
criterion_value = model.aic(X) | |
break | |
gm_params["reg_covar"] = reg_covar | |
# change the precision of "criter" based on sample size | |
criterion_value = round(criterion_value, int(np.log10(n_samples))) | |
results = _CollectResults(model, criterion_value, gm_params, ag_params) |
continue | ||
except AssertionError: | ||
reg_covar = _increase_reg(reg_covar) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add 2 inline comments to explain when we expect ValueError
and AssertionError
to be raised and why increasing the regularization is expected to help in each case?
model = GaussianMixture(**gm_params) | ||
model.reg_covar = reg_covar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but I am the feeling that the following would be more natural:
model = GaussianMixture(**gm_params) | |
model.reg_covar = reg_covar | |
model = GaussianMixture(reg_covar=reg_covar, **gm_params) |
criter = np.inf | ||
# below is the regularization scheme | ||
reg_covar = 0 | ||
while reg_covar <= 1 and criter == np.inf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this loop. Isn't reg_covar
supposed to be increased even when there is no ValueError
or AssertionError
being raised?
model.reg_covar = reg_covar | ||
try: | ||
# ignoring warning here because if convergence is not reached, | ||
# the regularization is automatically increased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the code I don't understand how the fact that the convergence is not reached is detected. I would have expected that model.n_iter_
would have to be compared to self.max_iter
but this is not the case.
Wouldn't it make sense to detect those cases and also _increase_reg
explicitly for those cases?
But maybe we should have a protection mechanism to avoid locking ourselves in an infinite while loop if max_iter
is too small and causes a prevents convergence whatever the value of reg_covar
.
self.linkage_ = results[best_idx].linkage | ||
self.reg_covar_ = results[best_idx].reg_covar | ||
self.best_model_ = results[best_idx].model | ||
self.n_iter_ = results[best_idx].model.n_iter_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we still have self.n_iter_ == self.max_iter
at this point we should raise ConvergenceWarning
explicitly here.
Reference Issues/PRs
Adds the implementation discussed in #19338
What does this implement/fix? Explain your changes.
It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:
mclust
, see below)Any other comments?
This work is inspired by
mclust
(a popular package in R for GM modeling) and its performance is evaluated on synthetic and real data (Athey, et al., 2020).This notebook roughly replicates the clustering experiments in this tutorial and also includes the
GaussianMixtureIC
algorithm proposed here.