-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Gaussian Mixture with BIC/AIC #19338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@amueller @NicolasHug What do you think? |
I might be missing something but I think a from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=10_000, centers=3)
param_grid = {
'n_components': range(1, 7),
'covariance_type': ['spherical', 'tied', 'diag', 'full']
}
def bic_scorer(fitted_gmm, X):
return -fitted_gmm.bic(X) # return -bic as higher is better for GridSearchCV
gs = GridSearchCV(GaussianMixture(), param_grid=param_grid, scoring=bic_scorer)
gs.fit(X)
gs.best_score_, gs.best_params_
Unless there's a warm-start "path" that we can follow to optimize the computations (as in LassoCV etc.)? (Side note, our docs about passing a custom scorer could be a lot better) |
Hi @NicolasHug, I work with @tliu68 on some of this work.
Mclust (and the implementation in Python that we've developed based on it) uses agglomerative clustering as one of the possible initializations to GMM - this is nice because the same initialization is used when sweeping over covariance types or any other model parameters, but also for varying Also, the approach that we were describing doesn't do any cross validation. At some point I figured out how to pass a dummy More than happy to talk about alternative solutions, though! |
If this is just a matter of initialization, could you simply pass gs = GridSearchCV(GaussianMixture(weights_init=..., means_init=...), param_grid=...)
So you're evaluating the BIC on the training data? I'm not able to confirm whether this is good or not but if so, you can definitely use a custom class TestOnTrainingSetCV:
def split(self, X, y=None, groups=None):
all_indices = np.arange(X.shape[0])
yield all_indices, all_indices
def get_n_splits(self, X=None, y=None, groups=None):
return 1 |
Hello @NicolasHug! Thank you for your response!
Indeed, the agglomerative clustering assignments can be used to compute the weights_init, means_init, etc which one may not know beforehand.
In our proposal, yes, the BIC values are computed for all possible models built on the entire training data. But I do see your point of using |
Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue? |
Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):
As far as we can tell, the above isn't trivially accomplished with
We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like |
Describe the workflow you want to enable
Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like
LassoLarsIC
that does the job automatically.Describe your proposed solution
Add a class (say
GaussianMixtureIC
, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:mclust
, see below)Additional context
mclust
is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).The text was updated successfully, but these errors were encountered: