[MRG] GaussianMixture with BIC/AIC #19562

tliu68 · 2021-02-25T15:26:53Z

Reference Issues/PRs

Adds the implementation discussed in #19338

What does this implement/fix? Explain your changes.

It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:

Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
Covariance constraints
Number of components

Any other comments?

This work is inspired by mclust (a popular package in R for GM modeling) and its performance is evaluated on synthetic and real data (Athey, et al., 2020).

This notebook roughly replicates the clustering experiments in this tutorial and also includes the GaussianMixtureIC algorithm proposed here.

…nto GaussianMixtureIC

jjerphan

Thanks @tliu68: this looks like a good addition! 🙂
I've left a few comments and suggestions.

sklearn/mixture/_base.py

sklearn/mixture/_gaussian_mixture_ic.py

sklearn/mixture/tests/test_gaussian_mixture_ic.py

sklearn/mixture/_gaussian_mixture_ic.py

tliu68 · 2021-02-28T09:51:31Z

@jjerphan Thank you for your review! I have made some updates according to your comments. In particular, do you think the new class I added _CollectResults is appropriate?

sklearn/mixture/_gaussian_mixture_ic.py

sklearn/mixture/tests/test_gaussian_mixture_ic.py

sklearn/mixture/_gaussian_mixture_ic.py

jjerphan · 2021-02-28T21:19:48Z

@tliu68 Thanks for your update!

In particular, do you think the new class I added _CollectResults is appropriate?

The private class you have added makes the implementation easier to understand and extend later on.

sklearn/mixture/tests/test_gaussian_mixture_ic.py

tliu68 · 2021-03-10T13:35:03Z

@jjerphan When you get a chance, do you mind taking a look at the new update? Thanks!

BTW, We have been working on the error that is causing some checks to fail. However, it seems to be related to AgglomerativeClustering itself and I am not sure how to fix that. Basically, when check_clustering was run on GaussianMixtureIC, some clustering parameters, among all the valid combinations that GaussianMixtureIC considered, trigger a ValueError: buffer source array is read-only. Interestingly, running the same check on AgglomerativeClustering does not trigger the error probably because the default parameters for AgglomerativeClustering happen to be not problematic. Please see this gist for details.

jjerphan · 2021-03-11T07:26:35Z

Thanks for the update, @tliu68.

I am currently busier, and it will takes some more days before I can get some time to review both the changes and the issue that you mention. If you know the commit this issue originates from, this would greatly helps. 🙂

tliu68 · 2021-03-11T13:16:03Z

Thanks for the update, @tliu68.

I am currently busier, and it will takes some more days before I can get some time to review both the changes and the issue that you mention. If you know the commit this issue originates from, this would greatly helps. 🙂

Sure! Actually, I don't think the part that is causing the issue has been changed in those commits. So maybe the latest one?

tliu68 · 2021-03-16T13:37:01Z

To clarify what we think is going on with the failing checks: one test in the estimator checks check_clustering(readonly_memmap=True) triggered the following error at the line agg.fit(X_subset) in our code:

ValueError: buffer source array is read-only

We observed that:

the error was triggered when running for some specific parameters AgglomerativeClustering(affinity="euclidean", linkage="single").
testing AgglomerativeClustering with the estimator checks is successful. However, in the check_clustering test, AgglomerativeClustering was only initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward") (included in the gist). Therefore this failing test is never actually run during the current sklearn test suite.
Our code only triggers this error because we sweep over agglomerative initialization schemes, including (affinity="euclidean", linkage="ward"). If the current test for AgglomerativeClustering also tested with these parameters, the test would also fail.
The test only fails when using memmap data. I.e. the error is gone when the following line in the gist is deleted X, y = create_memmap_backed_data([X, y])

So, this ultimately seems like a problem that only comes up in AgglomerativeClustering with the combination of memap data and certain choices of affinity and linkage (such as "euclidean" and "single").

Advice on how to proceed is greatly appreciated!

jovo · 2021-04-12T16:22:24Z

@jjerphan Are you waiting on @tliu68 for anything?

jjerphan · 2021-04-13T06:38:47Z

@jovo: I am not waiting on @tliu68 for anything, I am getting back to this PR.

jjerphan · 2021-04-13T08:04:55Z

Your report is right @tliu68. This problem is not linked to your PR and relates to the current cython implementations of hierarchical clustering methods (as here MST-Linkage-Core) which do not support read-only buffers but should.

As this problem has a bigger scope than your PR, I'll create an issue based on your snippet and inspect this problem.
Then we should be able to have your changes integrated.

Thanks for your patience!

jjerphan

In the meantime, here are a few suggestions in the meanwhile: we are nearly there! 🙂

sklearn/mixture/tests/test_gaussian_mixture_ic.py

sklearn/mixture/_gaussian_mixture_ic.py

sklearn/mixture/tests/test_gaussian_mixture_ic.py

…nto GaussianMixtureIC

…learn into GaussianMixtureIC

jjerphan

Thanks @tliu68!

Though, this is an usable method, I think this would be better integrated in scikit-learn-extra instead.

I guess we need to discuss it further with maintainers.

examples/mixture/plot_gmm_selection.py

doc/modules/mixture.rst

sklearn/mixture/_gaussian_mixture_ic.py

jjerphan · 2021-07-26T06:52:03Z

examples/mixture/plot_gmm_selection.py

@@ -22,7 +22,7 @@
 import matplotlib.pyplot as plt


Indeed, I've checked again and this is fixed.

…nto GaussianMixtureIC

tliu68 · 2021-08-14T06:09:44Z

@jjerphan We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn is the right place for this PR.

This PR is essentially a python port of R's mclust, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering), and frequently updated since then (in 1998, 2002, 2003, 2006, 2012, 2014, and 2016), yielding a total of more than 11K citations over nearly 30 years. It is also one of the most popular R packages ever, with ~110k downloads just this month. And, it is also one of the 5 core clustering packages in R.

Thank you again for your consideration!

…nto GaussianMixtureIC

jjerphan · 2021-08-28T17:44:10Z

Hi @tliu68 (and @tathey1, @bdpedigo and @PerifanosPrometheus),

I do acknowledge that mclust is popular as a library, however I cannot judge whether it is relevant integrating this method in scikit-learn.

To me, it is better for both maintenance and usability that scikit-learn focuses on the core algorithms for machine learning solely; this interface comes in handy to select a Gaussian mixture model based on a given criterion, but one could also create other interfaces to automatically select most performing models based on some criteria.

Nevertheless, this is my point of view which is not the one from a maintainer; and both you and I are waiting for a authoritative ones. :)

…nto GaussianMixtureIC

tliu68 · 2021-09-07T08:24:27Z

We do agree that scikit-learn focuses on the core algorithms, but using BIC or AIC for GMM model selection, in our opinion, is a standard procedure. Indeed, one can opt for other criteria for their GM models, but similar to LassoLarsIC, we believe we are simplifying the implementation of one of the most popular model selections for GMM, and are expanding the list of available implementations using IC in scikit-learn (updated in a recent commit).

Thank you for your patience and all of your help @jjerphan !

ogrisel

I gave this PR a pass and I have the feeling that this is too ad-hoc to be part of scikit-learn under a new class.

From reading the discussion in #19338 I understand that there is a strong computational dependency between the new agglomerative clustering init schemes and the model selection procedure and the incremental regularization strategy.

However I have the feeling that it would be too weird that to have a GaussianMixtureIC class with significantly richer init capabilities than the underlying GaussianMixture class in scikit-learn. I also have the feeling that increasing the init capabilities of GaussianMixture would introduce too much maintenance burden and cognitive load for docstring readers for little gain compared to the flexibility of passing arbitrary components centers.

I also find it unexpected that the average GaussianMixtureIC training would be orders of magnitude slower than GaussianMixture because the former is running automated model selection on the cartesian product of the hyperparameters of a rich model family.

I would therefore instead recommend to publish this implementation outside of scikit-learn, as an independently maintained Python package with released versions on PyPI installable via pip and conda-forge installable with conda along with its own documentation website with examples usages possibly using sphinx and sphinx-gallery to foster adoption among the Python data science community.

I would also recommend to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions.

Please find below some personal notes I made while reviewing the code.

ogrisel · 2021-05-03T12:16:03Z

examples/mixture/plot_gmm_selection.py

+cv_types = ["spherical", "tied", "diag", "full"]
+
+# Fit Gaussian mixture with EM for a range of n_components and cv_types
+gmIC = GaussianMixtureIC(


style:

Suggested change

gmIC = GaussianMixtureIC(

gmic = GaussianMixtureIC(

or gm_ic if you prefer. The following lines will have to be updated accordingly.

ogrisel · 2021-05-03T12:17:54Z

sklearn/mixture/tests/test_gaussian_mixture_ic.py

+    # covariance type is not in ['spherical', 'diag', 'tied', 'full']
+    _test_inputs(X, ValueError, covariance_type="1")
+
+    # euclidean is not an affinity option when ward is a linkage option


Suggested change

# euclidean is not an affinity option when ward is a linkage option

# manhattan is not an affinity option when the linkage is set to ward

ogrisel · 2021-05-03T12:30:08Z

sklearn/mixture/_gaussian_mixture_ic.py

+        )(
+            delayed(_fit_for_data)(ag_params, gm_params, seed)
+            for (ag_params, gm_params), seed in zip(param_grid, seeds)
+        )


Is using joblib-based multi-threading really beneficial here?

Have you done some benchmarks?

I would suspect that BLAS level parallelism (in the underlying matrix matrix operations) might already bring multicore speed up and than additional Python level multithreading might not bring any benefit and could potentially cause oversubscription issues.

It would be interesting to see some performance results on a machine with many cores (e.g. 8 or 16 cores) for various values of n_jobs and with and without parallel_kwargs = {"prefer": "threads"}.

ogrisel · 2021-10-11T08:25:02Z

sklearn/mixture/_gaussian_mixture_ic.py

+                criter = model.bic(X)
+            else:
+                criter = model.aic(X)
+            break
+
+        gm_params["reg_covar"] = reg_covar
+        # change the precision of "criter" based on sample size
+        criter = round(criter, int(np.log10(n_samples)))
+        results = _CollectResults(model, criter, gm_params, ag_params)


Suggested change

criter = model.bic(X)

else:

criter = model.aic(X)

break

gm_params["reg_covar"] = reg_covar

# change the precision of "criter" based on sample size

criter = round(criter, int(np.log10(n_samples)))

results = _CollectResults(model, criter, gm_params, ag_params)

criterion_value = model.bic(X)

else:

criterion_value = model.aic(X)

break

gm_params["reg_covar"] = reg_covar

# change the precision of "criter" based on sample size

criterion_value = round(criterion_value, int(np.log10(n_samples)))

results = _CollectResults(model, criterion_value, gm_params, ag_params)

ogrisel · 2021-10-11T08:26:31Z

sklearn/mixture/_gaussian_mixture_ic.py

+                continue
+            except AssertionError:
+                reg_covar = _increase_reg(reg_covar)
+                continue


Could you please add 2 inline comments to explain when we expect ValueError and AssertionError to be raised and why increasing the regularization is expected to help in each case?

ogrisel · 2021-10-11T08:30:34Z

sklearn/mixture/_gaussian_mixture_ic.py

+            model = GaussianMixture(**gm_params)
+            model.reg_covar = reg_covar


This works, but I am the feeling that the following would be more natural:

Suggested change

model = GaussianMixture(**gm_params)

model.reg_covar = reg_covar

model = GaussianMixture(reg_covar=reg_covar, **gm_params)

ogrisel · 2021-10-11T09:02:33Z

sklearn/mixture/_gaussian_mixture_ic.py

+        criter = np.inf
+        # below is the regularization scheme
+        reg_covar = 0
+        while reg_covar <= 1 and criter == np.inf:


I don't understand this loop. Isn't reg_covar supposed to be increased even when there is no ValueError or AssertionError being raised?

ogrisel · 2021-10-11T09:04:10Z

sklearn/mixture/_gaussian_mixture_ic.py

+            model.reg_covar = reg_covar
+            try:
+                # ignoring warning here because if convergence is not reached,
+                # the regularization is automatically increased


By reading the code I don't understand how the fact that the convergence is not reached is detected. I would have expected that model.n_iter_ would have to be compared to self.max_iter but this is not the case.

Wouldn't it make sense to detect those cases and also _increase_reg explicitly for those cases?

But maybe we should have a protection mechanism to avoid locking ourselves in an infinite while loop if max_iter is too small and causes a prevents convergence whatever the value of reg_covar.

ogrisel · 2021-10-11T09:08:59Z

sklearn/mixture/_gaussian_mixture_ic.py

+        self.linkage_ = results[best_idx].linkage
+        self.reg_covar_ = results[best_idx].reg_covar
+        self.best_model_ = results[best_idx].model
+        self.n_iter_ = results[best_idx].model.n_iter_


I think that we still have self.n_iter_ == self.max_iter at this point we should raise ConvergenceWarning explicitly here.

tliu68 added 2 commits February 25, 2021 23:14

add GaussianMixtureIC

b30ba49

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

f545e68

…nto GaussianMixtureIC

github-actions bot added the module:mixture label Feb 25, 2021

fix linting error

9782882

tliu68 marked this pull request as draft February 25, 2021 18:02

update input checking

4ad6770

jjerphan requested changes Feb 26, 2021

View reviewed changes

address new comments

54c611f

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/_gaussian_mixture_ic.py Outdated Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture_ic.py Outdated Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/_gaussian_mixture_ic.py Outdated Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture_ic.py Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture_ic.py Outdated Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture_ic.py Show resolved Hide resolved

jjerphan reviewed Feb 28, 2021

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture_ic.py Outdated Show resolved Hide resolved

tliu68 added 3 commits March 2, 2021 22:35

address new comments

bce5264

fix unwanted comments

c451ae5

fix imports

ecaaf30

jjerphan requested changes Apr 13, 2021

View reviewed changes 3262

This was referenced Apr 13, 2021

[BUG] Linkage/Hierarchical clustering methods fail on readonly memmapped datasets cython/cython#4114

Closed

[BUG] cluster.AgglomerativeClustering fails on readonly memmapped datasets #19875

Closed

tliu68 and others added 7 commits June 23, 2021 22:44

Merge branch 'main' into GaussianMixtureIC

28fa4b6

black formatting

b477901

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

1bdf692

…nto GaussianMixtureIC

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

357c4cd

…nto GaussianMixtureIC

Merge branch 'GaussianMixtureIC' of https://github.com/tliu68/scikit-…

aa92828

…learn into GaussianMixtureIC

try to fix linting

41dc941

fix docstring

64450f4

jjerphan requested changes Jul 26, 2021

View reviewed changes

tliu68 added 4 commits July 29, 2021 22:11

select best model w/ leaset param

8d757f7

fix linting

ce13fd5

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

a69a73d

…nto GaussianMixtureIC

fix black formatting

80d342b

jjerphan added the Waiting for Reviewer label Jul 30, 2021

tliu68 added 2 commits August 25, 2021 22:34

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

36d1637

…nto GaussianMixtureIC

update changelog; _validate_data

004fd4c

jjerphan requested a review from glemaitre September 2, 2021 13:37

tliu68 added 3 commits September 6, 2021 15:37

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

9a5eacc

…nto GaussianMixtureIC

modify doc for IC

537a8d1

document features_names_in

5bd750a

jjerphan added the Needs Decision Requires decision label Sep 7, 2021

ogrisel reviewed Oct 11, 2021

View reviewed changes

glemaitre removed their request for review December 16, 2021 10:14

cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed Waiting for Reviewer Needs Decision Requires decision labels May 9, 2022

tingshanL mentioned this pull request Jun 30, 2023

[MRG] GaussianMixture with BIC/AIC #26735

Open

tingshanL mentioned this pull request Oct 30, 2024

GMIC tingshanL/scikit-learn#1

Open

	# euclidean is not an affinity option when ward is a linkage option
	# manhattan is not an affinity option when the linkage is set to ward

		model = GaussianMixture(**gm_params)
		model.reg_covar = reg_covar

	model = GaussianMixture(**gm_params)
	model.reg_covar = reg_covar
	model = GaussianMixture(reg_covar=reg_covar, **gm_params)

Uh oh!

[MRG] GaussianMixture with BIC/AIC #19562

Are you sure you want to change the base?

[MRG] GaussianMixture with BIC/AIC #19562

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment