8000 [MRG] GaussianMixture with BIC/AIC by tliu68 · Pull Request #19562 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] GaussianMixture with BIC/AIC #19562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

tliu68
Copy link
Contributor
@tliu68 tliu68 commented Feb 25, 2021

Reference Issues/PRs

Adds the implementation discussed in #19338

What does this implement/fix? Explain your changes.

It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:

  • Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
  • Covariance constraints
  • Number of components

Any other comments?

This work is inspired by mclust (a popular package in R for GM modeling) and its performance is evaluated on synthetic and real data (Athey, et al., 2020).

This notebook roughly replicates the clustering experiments in this tutorial and also includes the GaussianMixtureIC algorithm proposed here.

@tliu68 tliu68 marked this pull request as draft February 25, 2021 18:02
Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tliu68: this looks like a good addition! 🙂
I've left a few comments and suggestions.

@tliu68
Copy link
Contributor Author
tliu68 commented Feb 28, 2021

@jjerphan Thank you for your review! I have made some updates according to your comments. In particular, do you think the new class I added _CollectResults is appropriate?

@jjerphan
Copy link
Member

@tliu68 Thanks for your update!

In particular, do you think the new class I added _CollectResults is appropriate?

The private class you have added makes the implementation easier to understand and extend later on.

@tliu68
Copy link
Contributor Author
tliu68 commented Mar 10, 2021

@jjerphan When you get a chance, do you mind taking a look at the new update? Thanks!

BTW, We have been working on the error that is causing some checks to fail. However, it seems to be related to AgglomerativeClustering itself and I am not sure how to fix that. Basically, when check_clustering was run on GaussianMixtureIC, some clustering parameters, among all the valid combinations that GaussianMixtureIC considered, trigger a ValueError: buffer source array is read-only. Interestingly, running the same check on AgglomerativeClustering does not trigger the error probably because the default parameters for AgglomerativeClustering happen to be not problematic. Please see this gist for details.

@jjerphan
Copy link
Member

Thanks for the update, @tliu68.

I am currently busier, and it will takes some more days before I can get some time to review both the changes and the issue that you mention. If you know the commit this issue originates from, this would greatly helps. 🙂

@tliu68
Copy link
Contributor Author
tliu68 commented Mar 11, 2021

Thanks for the update, @tliu68.

I am currently busier, and it will takes some more days before I can get some time to review both the changes and the issue that you mention. If you know the commit this issue originates from, this would greatly helps. 🙂

Sure! Actually, I don't think the part that is causing the issue has been changed in those commits. So maybe the latest one?

@tliu68
Copy link
Contributor Author
tliu68 commented Mar 16, 2021

To clarify what we think is going on with the failing checks: one test in the estimator checks check_clustering(readonly_memmap=True) triggered the following error at the line agg.fit(X_subset) in our code:

ValueError: buffer source array is read-only

We observed that:

  1. the error was triggered when running for some specific parameters AgglomerativeClustering(affinity="euclidean", linkage="single").
  2. testing AgglomerativeClustering with the estimator checks is successful. However, in the check_clustering test, AgglomerativeClustering was only initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward") (included in the gist). Therefore this failing test is never actually run during the current sklearn test suite.
  3. Our code only triggers this error because we sweep over agglomerative initialization schemes, including (affinity="euclidean", linkage="ward"). If the current test for AgglomerativeClustering also tested with these parameters, the test would also fail.
  4. The test only fails when using memmap data. I.e. the error is gone when the following line in the gist is deleted X, y = create_memmap_backed_data([X, y])

So, this ultimately seems like a problem that only comes up in AgglomerativeClustering with the combination of memap data and certain choices of affinity and linkage (such as "euclidean" and "single").

Advice on how to proceed is greatly appreciated!

@jovo
Copy link
jovo commented Apr 12, 2021

@jjerphan Are you waiting on @tliu68 for anything?

@jjerphan
Copy link
Member

@jovo: I am not waiting on @tliu68 for anything, I am getting back to this PR.

@jjerphan
Copy link
Member

Your report is right @tliu68. This problem is not linked to your PR and relates to the current cython implementations of hierarchical clustering methods (as here MST-Linkage-Core) which do not support read-only buffers but should.

As this problem has a bigger scope than your PR, I'll create an issue based on your snippet and inspect this problem.
Then we should be able to have your changes integrated.

Thanks for your patience!

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, here are a few suggestions in the meanwhile: we are nearly there! 🙂

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tliu68!

Though, this is an usable method, I think this would be better integrated in scikit-learn-extra instead.

I guess we need to discuss it further with maintainers.

@@ -22,7 +22,7 @@
import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I've checked again and this is fixed.

@tliu68
Copy link
Contributor Author
tliu68 commented Aug 14, 2021

@jjerphan We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn is the right place for this PR.

This PR is essentially a python port of R's mclust, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering), and frequently updated since then (in 1998, 2002, 2003, 2006, 2012, 2014, and 2016), yielding a total of more than 11K citations over nearly 30 years. It is also one of the most popular R packages ever, with ~110k downloads just this month. And, it is also one of the 5 core clustering packages in R.

Thank you again for your consideration!

@jjerphan
Copy link
Member

Hi @tliu68 (and @tathey1, @bdpedigo and @PerifanosPrometheus),

I do acknowledge that mclust is popular as a library, however I cannot judge whether it is relevant integrating this method in scikit-learn.

To me, it is better for both maintenance and usability that scikit-learn focuses on the core algorithms for machine learning solely; this interface comes in handy to select a Gaussian mixture model based on a given criterion, but one could also create other interfaces to automatically select most performing models based on some criteria.

Nevertheless, this is my point of view which is not the one from a maintainer; and both you and I are waiting for a authoritative ones. :)

@jjerphan jjerphan requested a review from glemaitre September 2, 2021 13:37
@tliu68
Copy link
Contributor Author
tliu68 commented Sep 7, 2021

We do agree that scikit-learn focuses on the core algorithms, but using BIC or AIC for GMM model selection, in our opinion, is a standard procedure. Indeed, one can opt for other criteria for their GM models, but similar to LassoLarsIC, we believe we are simplifying the implementation of one of the most popular model selections for GMM, and are expanding the list of available implementations using IC in scikit-learn (updated in a recent commit).

Thank you for your patience and all of your help @jjerphan !

@jjerphan jjerphan added the Needs Decision Requires decision label Sep 7, 2021
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this PR a pass and I have the feeling that this is too ad-hoc to be part of scikit-learn under a new class.

From reading the discussion in #19338 I understand that there is a strong computational dependency between the new agglomerative clustering init schemes and the model selection procedure and the incremental regularization strategy.

However I have the feeling that it would be too weird that to have a GaussianMixtureIC class with significantly richer init capabilities than the underlying GaussianMixture class in scikit-learn. I also have the feeling that increasing the init capabilities of GaussianMixture would introduce too much maintenance burden and cognitive load for docstring readers for little gain compared to the flexibility of passing arbitrary components centers.

I also find it unexpected that the average GaussianMixtureIC training would be orders of magnitude slower than GaussianMixture because the former is running automated model selection on the cartesian product of the hyperparameters of a rich model family.

I would therefore instead recommend to publish this implementation outside of scikit-learn, as an independently maintained Python package with released versions on PyPI installable via pip and conda-forge installable with conda along with its own documentation website with examples usages possibly using sphinx and sphinx-gallery to foster adoption among the Python data science community.

I would also recommend to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions.

Please find below some personal notes I made while reviewing the code.

cv_types = ["spherical", "tied", "diag", "full"]

# Fit Gaussian mixture with EM for a range of n_components and cv_types
gmIC = GaussianMixtureIC(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:

Suggested change
gmIC = GaussianMixtureIC(
gmic = GaussianMixtureIC(

or gm_ic if you prefer. The following lines will have to be updated accordingly.

# covariance type is not in ['spherical', 'diag', 'tied', 'full']
_test_inputs(X, ValueError, covariance_type="1")

# euclidean is not an affinity option when ward is a linkage option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# euclidean is not an affinity option when ward is a linkage option
# manhattan is not an affinity option when the linkage is set to ward

)(
delayed(_fit_for_data)(ag_params, gm_params, seed)
for (ag_params, gm_params), seed in zip(param_grid, seeds)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using joblib-based multi-threading really beneficial here?

Have you done some benchmarks?

I would suspect that BLAS level parallelism (in the underlying matrix matrix operations) might already bring multicore speed up and than additional Python level multithreading might not bring any benefit and could potentially cause oversubscription issues.

It would be interesting to see some performance results on a machine with many cores (e.g. 8 or 16 cores) for various values of n_jobs and with and without parallel_kwargs = {"prefer": "threads"}.

Comment on lines +443 to +451
criter = model.bic(X)
else:
criter = model.aic(X)
break

gm_params["reg_covar"] = reg_covar
# change the precision of "criter" based on sample size
criter = round(criter, int(np.log10(n_samples)))
results = _CollectResults(model, criter, gm_params, ag_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
criter = model.bic(X)
else:
criter = model.aic(X)
break
gm_params["reg_covar"] = reg_covar
# change the precision of "criter" based on sample size
criter = round(criter, int(np.log10(n_samples)))
results = _CollectResults(model, criter, gm_params, ag_params)
criterion_value = model.bic(X)
else:
criterion_value = model.aic(X)
break
gm_params["reg_covar"] = reg_covar
# change the precision of "criter" based on sample size
criterion_value = round(criterion_value, int(np.log10(n_samples)))
results = _CollectResults(model, criterion_value, gm_params, ag_params)

continue
except AssertionError:
reg_covar = _increase_reg(reg_covar)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add 2 inline comments to explain when we expect ValueError and AssertionError to be raised and why increasing the regularization is expected to help in each case?

Comment on lines +419 to +420
model = GaussianMixture(**gm_params)
model.reg_covar = reg_covar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but I am the feeling that the following would be more natural:

Suggested change
model = GaussianMixture(**gm_params)
model.reg_covar = reg_covar
model = GaussianMixture(reg_covar=reg_covar, **gm_params)

criter = np.inf
# below is the regularization scheme
reg_covar = 0
while reg_covar <= 1 and criter == np.inf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this loop. Isn't reg_covar supposed to be increased even when there is no ValueError or AssertionError being raised?

model.reg_covar = reg_covar
try:
# ignoring warning here because if convergence is not reached,
# the regularization is automatically increased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the code I don't understand how the fact that the convergence is not reached is detected. I would have expected that model.n_iter_ would have to be compared to self.max_iter but this is not the case.

Wouldn't it make sense to detect those cases and also _increase_reg explicitly for those cases?

But maybe we should have a protection mechanism to avoid locking ourselves in an infinite while loop if max_iter is too small and causes a prevents convergence whatever the value of reg_covar.

self.linkage_ = results[best_idx].linkage
self.reg_covar_ = results[best_idx].reg_covar
self.best_model_ = results[best_idx].model
self.n_iter_ = results[best_idx].model.n_iter_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we still have self.n_iter_ == self.max_iter at this point we should raise ConvergenceWarning explicitly here.

@glemaitre glemaitre removed their request for review December 16, 2021 10:14
@cmarmo cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed Waiting for Reviewer Needs Decision Requires decision labels May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:mixture Needs Decision - Include Feature Requires decision regarding including feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0