8000 [MRG] GaussianMixture with BIC/AIC by tingshanL · Pull Request #26735 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

tingshanL
Copy link

Reference Issues/PRs

Fixes #19338. Automates the selection in Gaussian Mixture Model Selection. Adds a basic GaussianMixtureIC estimator without initializing with agglomerative clustering as discussed in #19562.

What does this implement/fix? Explain your changes.

It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:

  • Covariance constraints

  • Number of components

Any other comments?

@github-actions
Copy link
github-actions bot commented Jun 30, 2023

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff check

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/mixture/_gaussian_mixture_ic.py:1:1: CPY001 Missing copyright notice at top of file
sklearn/mixture/_gaussian_mixture_ic.py:10:1: TID252 Prefer absolute imports over relative imports
   |
 8 | import numpy as np
 9 |
10 | from ..base import BaseEstimator, ClusterMixin
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TID252
11 | from ..model_selection import GridSearchCV
12 | from ..utils._param_validation import (
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:10:1: TID252 Prefer absolute imports over relative imports
   |
 8 | import numpy as np
 9 |
10 | from ..base import BaseEstimator, ClusterMixin
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TID252
11 | from ..model_selection import GridSearchCV
12 | from ..utils._param_validation import (
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:11:1: TID252 Prefer absolute imports over relative imports
   |
10 | from ..base import BaseEstimator, ClusterMixin
11 | from ..model_selection import GridSearchCV
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TID252
12 | from ..utils._param_validation import (
13 |     Integral,
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:12:1: TID252 Prefer absolute imports over relative imports
   |
10 |   from ..base import BaseEstimator, ClusterMixin
11 |   from ..model_selection import GridSearchCV
12 | / from ..utils._param_validation import (
13 | |     Integral,
14 | |     Interval,
15 | |     InvalidParameterError,
16 | |     StrOptions,
17 | | )
   | |_^ TID252
18 |   from ..utils.validation import check_is_fitted
19 |   from . import GaussianMixture
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:12:1: TID252 Prefer absolute imports over relative imports
   |
10 |   from ..base import BaseEstimator, ClusterMixin
11 |   from ..model_selection import GridSearchCV
12 | / from ..utils._param_validation import (
13 | |     Integral,
14 | |     Interval,
15 | |     InvalidParameterError,
16 | |     StrOptions,
17 | | )
   | |_^ TID252
18 |   from ..utils.validation import check_is_fitted
19 |   from . import GaussianMixture
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:12:1: TID252 Prefer absolute imports over relative imports
   |
10 |   from ..base import BaseEstimator, ClusterMixin
11 |   from ..model_selection import GridSearchCV
12 | / from ..utils._param_validation import (
13 | |     Integral,
14 | |     Interval,
15 | |     InvalidParameterError,
16 | |     StrOptions,
17 | | )
   | |_^ TID252
18 |   from ..utils.validation import check_is_fitted
19 |   from . import GaussianMixture
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:12:1: TID252 Prefer absolute imports over relative imports
   |
10 |   from ..base import BaseEstimator, ClusterMixin
11 |   from ..model_selection import GridSearchCV
12 | / from ..utils._param_validation import (
13 | |     Integral,
14 | |     Interval,
15 | |     InvalidParameterError,
16 | |     StrOptions,
17 | | )
   | |_^ TID252
18 |   from ..utils.validation import check_is_fitted
19 |   from . import GaussianMixture
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:18:1: TID252 Prefer absolute imports over relative imports
   |
16 |     StrOptions,
17 | )
18 | from ..utils.validation import check_is_fitted
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TID252
19 | from . import GaussianMixture
   |
   = help: Replace relative imports with absolute imports

sklearn/mixture/_gaussian_mixture_ic.py:19:1: TID252 Prefer absolute imports over relative imports
   |
17 | )
18 | from ..utils.validation import check_is_fitted
19 | from . import GaussianMixture
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TID252
   |
   = help: Replace relative imports with absolute imports

Found 10 errors.
No fixes available (9 hidden fixes can be enabled with the `--unsafe-fixes` option).

ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


--- sklearn/mixture/_gaussian_mixture_ic.py
+++ sklearn/mixture/_gaussian_mixture_ic.py
@@ -4,7 +4,6 @@
 #          Thomas Athey <tathey1@jhmi.edu>
 #          Benjamin Pedigo <bpedigo@jhu.edu>
 
-
 import numpy as np
 
 from ..base import BaseEstimator, ClusterMixin

1 file would be reformatted, 927 files already formatted

Generated for commit: a08d428. Link to the linter CI: here

@tingshanL tingshanL marked this pull request as ready for review July 25, 2023 21:18
@tingshanL tingshanL changed the title GaussianMixture with BIC/AIC [MRG] GaussianMixture with BIC/AIC Jul 12, 2024
@jovo
Copy link
jovo commented Jul 23, 2024

@amueller @NicolasHug What do you think?

@jovo
Copy link
jovo commented Jul 30, 2024

@jjerphan @ogrisel This PR is designed to greatly simplify user's life when trying to do something like in this tutorial: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html, and is highly simplified relative to other recently proposed PRs, such as #19562. We are grateful for your feedback, and look forward to finalizing this.

@jovo
Copy link
jovo commented Aug 13, 2024

@adam2392 Hey Adam - Curious to hear what you think about this?

@tingshanL
Copy link
Author

@adam2392 Hey Adam - Curious to hear what you think about this?

Hello Adam @adam2392,

We have reviewed and addressed all feedback provided here. Below is a summary of the primary comments and our responses:

  • "significantly richer init capabilities than the underlying GaussianMixture class in scikit-learn": currently, GaussianMixtureIC selects models only based on covariance constraints and the number of components, the same as the GM model selection example in the library.
  • "orders of magnitude slower than GaussianMixture": using current GaussianMixtureIC to run the same task as in the example does not take more time. On my computer, both fitting procedures took around 1.3s.
  • "to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions": all checks, including the tests related to check_estimator, have passed.
  • "joblib-based multi-threading": it is no longer in the code.
  • "regularization": also not included in the code anymore.

With these revisions, we believe the code is ready for the next review. Please let us know if there is anything further we should adjust. Thank you very much for your time and insights!

@adam2392
Copy link
Member

Hey @tingshanL okay thanks!

I think this will require a discussion among some of the more senior maintainers on the team. Personally, I do see the use of a mclust like algorithm within Python/sklearn ecosystem, since I have used it in the past in R. However, we'll have to see what others say… I know they're busy, and including new models is not super easy, so thank you for the patience.

Out of curiosity (perhaps you can just summarize in the PR description if we want to reserve the space for other discussion), how does this differ from mclust? If there are significant differences, what would need to be done to make this 1-1 matching? This is just some high level info I'm curious on to guide the discussion, so feel free to not spend too much time addressing this question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gaussian Mixture with BIC/AIC

5 participants

0