Gaussian Mixture with BIC/AIC #19338

tliu68 · 2021-02-03T15:15:53Z

Describe the workflow you want to enable

Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC that does the job automatically.

Describe your proposed solution

Add a class (say GaussianMixtureIC, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:

Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
Covariance constraints
Number of components

Additional context

mclust is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).

The text was updated successfully, but these errors were encountered:

jovo · 2021-02-08T18:00:35Z

@amueller @NicolasHug What do you think?

NicolasHug · 2021-02-09T11:16:56Z

I might be missing something but I think a GridSearchCV + custom scorer is what you need:

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=10_000, centers=3)

param_grid = {
    'n_components': range(1, 7),
    'covariance_type': ['spherical', 'tied', 'diag', 'full']
}


def bic_scorer(fitted_gmm, X):
    return -fitted_gmm.bic(X)  # return -bic as higher is better for GridSearchCV

gs = GridSearchCV(GaussianMixture(), param_grid=param_grid, scoring=bic_scorer)
gs.fit(X)
gs.best_score_, gs.best_params_

(-15844.66444767653, {'covariance_type': 'spherical', 'n_components': 3})

Unless there's a warm-start "path" that we can follow to optimize the computations (as in LassoCV etc.)?

(Side note, our docs about passing a custom scorer could be a lot better)

bdpedigo · 2021-02-09T14:06:55Z

Hi @NicolasHug, I work with @tliu68 on some of this work.

Unless there's a warm-start "path" that we can follow to optimize the computations (as in LassoCV etc.)?

Mclust (and the implementation in Python that we've developed based on it) uses agglomerative clustering as one of the possible initializations to GMM - this is nice because the same initialization is used when sweeping over covariance types or any other model parameters, but also for varying n_components, because given one run of the agglomerative clustering we can reconstruct a "flat" clustering for multiple n_components. So if i understand you, I think this is analogous to the "warm-start path" in LassoLars.

Also, the approach that we were describing doesn't do any cross validation. At some point I figured out how to pass a dummy cv function to GridSearchCV that leads to no cross validation to get around this but that felt hacky to me.

More than happy to talk about alternative solutions, though!

NicolasHug · 2021-02-09T15:03:03Z

Mclust (and the implementation in Python that we've developed based on it) uses agglomerative clustering as one of the possible initializations to GMM

If this is just a matter of initialization, could you simply pass weights_init, means_init, etc. to the estimator that gets searched over? Something like

gs = GridSearchCV(GaussianMixture(weights_init=..., means_init=...), param_grid=...)

Also, the approach that we were describing doesn't do any cross validation

So you're evaluating the BIC on the training data? I'm not able to confirm whether this is good or not but if so, you can definitely use a custom cv like so:

class TestOnTrainingSetCV:
    def split(self, X, y=None, groups=None):
        all_indices = np.arange(X.shape[0])
        yield all_indices, all_indices
    
    def get_n_splits(self, X=None, y=None, groups=None):
        return 1

tliu68 · 2021-02-09T15:38:57Z

Hello @NicolasHug! Thank you for your response!

If this is just a matter of initialization, could you simply pass weights_init, means_init, etc. to the estimator that gets searched over?

Indeed, the agglomerative clustering assignments can be used to compute the weights_init, means_init, etc which one may not know beforehand.

So you're evaluating the BIC on the training data?

In our proposal, yes, the BIC values are computed for all possible models built on the entire training data. But I do see your point of using cv. We will consider that. Again, thank you for your suggestions!

NicolasHug · 2021-02-09T16:33:58Z

Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue?

bdpedigo · 2021-02-11T14:26:23Z

Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):

run agglomerative clustering (with different options for linkage, affinity, etc.) to generate a set of initial labelings. The same run of agglomerative clustering can be used for various levels of n_components.
fit GaussianMixtures using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}.
choose the best model based on BIC on the whole dataset.

As far as we can tell, the above isn't trivially accomplished with GridSearchCV for a few reasons (some of which were already mentioned above, but just repeating here for clarity):

Running agglomerative clustering with multiple different settings, then extracting the appropriate "flat" clustering for the right value of n_components is not hard, but does take a bit of code.
Computing the initial parameters given these clusterings also takes a bit of code, as GaussianMixture currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).
There is no cross-validation involved, meaning one would have to use the "dummy" cross-validation solution described above.
There are also some details about how mclust handles the covariance regularization which don't lend themselves to naive grid search easily.

We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV given all of the reasons above. While it wouldn't be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.

tliu68 added the New Feature label Feb 3, 2021

cmarmo added the module:mixture label Feb 4, 2021

tliu68 mentioned this issue Feb 25, 2021

[MRG] GaussianMixture with BIC/AIC #19562

Open

jamesmyatt mentioned this issue Feb 7, 2023

Scikit-learn implementation of BIC smazzanti/are_you_still_using_elbow_method#1

Open

tingshanL linked a pull request Jun 30, 2023 that will close this issue

[MRG] GaussianMixture with BIC/AIC #26735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaussian Mixture with BIC/AIC #19338

Gaussian Mixture with BIC/AIC #19338

Gaussian Mixture with BIC/AIC #19338

Gaussian Mixture with BIC/AIC #19338

Comments

Describe the workflow you want to enable

Describe your proposed solution

Additional context