8000 Gaussian Mixture with BIC/AIC · Issue #19338 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Gaussian Mixture with BIC/AIC #19338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tliu68 opened this issue Feb 3, 2021 · 7 comments · May be fixed by #26735
Open

Gaussian Mixture with BIC/AIC #19338

tliu68 opened this issue Feb 3, 2021 · 7 comments · May be fixed by #26735

Comments

@tliu68
Copy link
Contributor
tliu68 commented Feb 3, 2021

Describe the workflow you want to enable

Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC that does the job automatically.

Describe your proposed solution

Add a class (say GaussianMixtureIC, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:

  • Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
  • Covariance constraints
  • Number of components

Additional context

mclust is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).

@jovo
Copy link
jovo commented Feb 8, 2021

@amueller @NicolasHug What do you think?

@NicolasHug
Copy link
Member

I might be missing something but I think a GridSearchCV + custom scorer is what you need:

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=10_000, centers=3)

param_grid = {
    'n_components': range(1, 7),
    'covariance_type': ['spherical', 'tied', 'diag', 'full']
}


def bic_scorer(fitted_gmm, X):
    return -fitted_gmm.bic(X)  # return -bic as higher is better for GridSearchCV

gs = GridSearchCV(GaussianMixture(), param_grid=param_grid, scoring=bic_scorer)
gs.fit(X)
gs.best_score_, gs.best_params_

(-15844.66444767653, {'covariance_type': 'spherical', 'n_components': 3})

Unless there's a warm-start "path" that we can follow to optimize the computations (as in LassoCV etc.)?

(Side note, our docs about passing a custom scorer could be a lot better)

@bdpedigo
Copy link
Contributor
bdpedigo commented Feb 9, 2021

Hi @NicolasHug, I work with @tliu68 on some of this work.

Unless there's a warm-start "path" that we can follow to optimize the computations (as in LassoCV etc.)?

Mclust (and the implementation in Python that we've developed based on it) uses agglomerative clustering as one of the possible initializations to GMM - this is nice because the same initialization is used when sweeping over covariance types or any other model parameters, but also for varying n_components, because given one run of the agglomerative clustering we can reconstruct a "flat" clustering for multiple n_components. So if i understand you, I think this is analogous to the "warm-start path" in LassoLars.

Also, the approach that we were describing doesn't do any cross validation. At some point I figured out how to pass a dummy cv function to GridSearchCV that leads to no cross validation to get around this but that felt hacky to me.

More than happy to talk about alternative solutions, though!

@NicolasHug
Copy link
Member
NicolasHug commented Feb 9, 2021

Mclust (and the implementation in Python that we've developed based on it) uses agglomerative clustering as one of the possible initializations to GMM

If this is just a matter of initialization, could you simply pass weights_init, means_init, etc. to the estimator that gets searched over? Something like

gs = GridSearchCV(GaussianMixture(weights_init=..., means_init=...), param_grid=...)

Also, the approach that we were describing doesn't do any cross validation

So you're evaluating the BIC on the training data? I'm not able to confirm whether this is good or not but if so, you can definitely use a custom cv like so:

class TestOnTrainingSetCV:
    def split(self, X, y=None, groups=None):
        all_indices = np.arange(X.shape[0])
        yield all_indices, all_indices
    
    def get_n_splits(self, X=None, y=None, groups=None):
        return 1

@tliu68
Copy link
Contributor Author
tliu68 commented Feb 9, 2021

Hello @NicolasHug! Thank you for your response!

If this is just a matter of initialization, could you simply pass weights_init, means_init, etc. to the estimator that gets searched over?

Indeed, the agglomerative clustering assignments can be used to compute the weights_init, means_init, etc which one may not know beforehand.

So you're evaluating the BIC on the training data?

In our proposal, yes, the BIC values are computed for all possible models built on the entire training data. But I do see your point of using cv. We will consider that. Again, thank you for your suggestions!

@NicolasHug
Copy link
Member
NicolasHug commented Feb 9, 2021

Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue?

@bdpedigo
Copy link
Contributor
bdpedigo commented Feb 11, 2021

Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):

  1. run agglomerative clustering (with different options for linkage, affinity, etc.) to generate a set of initial labelings. The same run of agglomerative clustering can be used for various levels of n_components.
  2. fit GaussianMixtures using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}.
  3. choose the best model based on BIC on the whole dataset.

As far as we can tell, the above isn't trivially accomplished with GridSearchCV for a few reasons (some of which were already mentioned above, but just repeating here for clarity):

  • Running agglomerative clustering with multiple different settings, then extracting the appropriate "flat" clustering for the right value of n_components is not hard, but does take a bit of code.
  • Computing the initial parameters given these clusterings also takes a bit of code, as GaussianMixture currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).
  • There is no cross-validation involved, meaning one would have to use the "dummy" cross-validation solution described above.
  • There are also some details about how mclust handles the covariance regularization which don't lend themselves to naive grid search easily.

We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV given all of the reasons above. While it wouldn't be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0