ENH Adds n_feature_in_ checking to cluster #18727

thomasjpfan · 2020-11-02T05:27:21Z

Continues #18514

sklearn/cluster/_kmeans.py

ogrisel

LGTM, just a nitpick.

sklearn/cluster/_birch.py

thomasjpfan · 2020-11-02T16:27:58Z

While finishing these up, I am growing more concerned with validating twice. Should all functions that call check_array have a check_input so the caller say that "I have already checked the input"?

thomasjpfan · 2020-11-02T16:30:29Z

Or we set the context manager assume_finite=True for now, to avoid the nan check.

ogrisel · 2020-11-02T16:31:03Z

While finishing these up, I am growing more concerned with validating twice. Should all functions that call check_array have a check_input so the caller say that "I have already checked the input"?

I am not sure how that would work. Sometimes only the caller knows that the check has already been done. We would need to dig an hole in the API of predict / transform and similar to pass this flag. Unless we use a context manager as discussed here: #18691 (comment)

Edit:

Or we set the context manager assume_finite=True for now, to avoid the nan check.

I saw this reply to your comment after writing my own...

thomasjpfan · 2020-11-02T17:28:46Z

sklearn/cluster/_affinity_propagation.py

+            with config_context(assume_finite=True):
+                return pairwise_distances_argmin(X, self.cluster_centers_)


As a quick benchmark:

from sklearn.cluster import AffinityPropagation from sklearn.datasets import make_classification X, _ = make_classification(n_features=10_000, n_samples=5_000, random_state=42) aff_prop = AffinityPropagation(random_state=42) aff_prop.fit(X) # this PR %timeit aff_prop.predict(X) # 182 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # master %timeit aff_prop.predict(X) # 254 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ogrisel · 2020-11-02T20:44:09Z

The failure of the recently merged test_poisson_zero_nodes is unrelated. Probably a missing rng seed. Will do a PR.

amueller · 2020-11-20T21:48:45Z

sklearn/cluster/_feature_agglomeration.py

@@ -38,10 +37,7 @@ def transform(self, X):
        """
        check_is_fitted(self)

-        X = check_array(X)
-        if len(self.labels_) != X.shape[1]:


We're assuming that the invariance of len(self.labels_) == X_train.shape[1] is enforced elsewhere, right? I guess this was only a workaround for not having n_features_in_ anyway.

amueller · 2020-11-20T21:49:34Z

sklearn/cluster/_kmeans.py

-                f"Incorrect number of features. Got {n_features} features, "
-                f"expected {expected_n_features}.")
-
+        X = self._validate_data(X, accept_sparse='csr', reset=False,


do we still need the function now? but fine with me.

thomasjpfan and others added 9 commits October 9, 2020 12:36

CI Check review_request_removed

a501ca1

ENH Checks n_features_in_ in cluster module

0fb69d4

Merge remote-tracking branch 'upstream/master' into master

f48580a

CI Fixes CI sync

357f268

Merge remote-tracking branch 'upstream/master' into master

8360cfb

Merge branch 'fix_ci_sync' into master

fa74c44

Merge remote-tracking branch 'origin/master' into pr_branch

4b8f634

Merge remote-tracking branch 'upstream/master' into n_features_cluster

0bc1f28

REV Reduces diff

ce54d6c

github-actions bot added the module:cluster label Nov 2, 2020

ogrisel reviewed Nov 2, 2020

View reviewed changes

sklearn/cluster/_kmeans.py Show resolved Hide resolved

ogrisel approved these changes Nov 2, 2020

View reviewed changes

sklearn/cluster/_birch.py Outdated Show resolved Hide resolved

STY Formatting

5596764

ENH Uses context manager to avoid finite check

86bd36c

thomasjpfan commented Nov 2, 2020

View reviewed changes

ogrisel mentioned this pull request Nov 2, 2020

FIX numerical rounding issue causing test_poisson_zero_nodes to fail at random #18737

Merged

ogrisel added the Waiting for Reviewer label Nov 2, 2020

ogrisel mentioned this pull request Nov 3, 2020

FIX validate input array-like in KDTree and BallTree #18691

Merged

cmarmo mentioned this pull request Nov 3, 2020

Investigate and fix performances in *_tree dependent algorithms due to multiple input validation. #18749

Closed

amueller reviewed Nov 20, 2020

View reviewed changes

amueller approved these changes Nov 20, 2020

View reviewed changes

amueller merged commit e970678 into scikit-learn:master Nov 20, 2020

amueller mentioned this pull request Nov 20, 2020

ENH Sets assume_finite in _non_negative_factorization #18581

Merged

sherbold mentioned this pull request Feb 23, 2021

Counterintuitive AttributeError in Birch for very large numbers #17966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Adds n_feature_in_ checking to cluster #18727

ENH Adds n_feature_in_ checking to cluster #18727

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		with config_context(assume_finite=True):
		return pairwise_distances_argmin(X, self.cluster_centers_)

Uh oh!

ENH Adds n_feature_in_ checking to cluster #18727

ENH Adds n_feature_in_ checking to cluster #18727

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!