8000 ENH Adds n_feature_in_ checking to cluster by thomasjpfan · Pull Request #18727 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Adds n_feature_in_ checking to cluster #18727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 20, 2020

Conversation

thomasjpfan
Copy link
Member

Continues #18514

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nitpick.

@thomasjpfan
Copy link
Member Author

While finishing these up, I am growing more concerned with validating twice. Should all functions that call check_array have a check_input so the caller say that "I have already checked the input"?

@thomasjpfan
Copy link
Member Author

Or we set the context manager assume_finite=True for now, to avoid the nan check.

@ogrisel
Copy link
Member
ogrisel commented Nov 2, 2020

While finishing these up, I am growing more concerned with validating twice. Should all functions that call check_array have a check_input so the caller say that "I have already checked the input"?

I am not sure how that would work. Sometimes only the caller knows that the check has already been done. We would need to dig an hole in the API of predict / transform and similar to pass this flag. Unless we use a context manager as discussed here: #18691 (comment)

Edit:

Or we set the context manager assume_finite=True for now, to avoid the nan check.

I saw this reply to your comment after writing my own...

Comment on lines +456 to +457
with config_context(assume_finite=True):
return pairwise_distances_argmin(X, self.cluster_centers_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a quick benchmark:

from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_classification

X, _ = make_classification(n_features=10_000, n_samples=5_000, random_state=42)
aff_prop = AffinityPropagation(random_state=42)
aff_prop.fit(X)

# this PR
%timeit aff_prop.predict(X)
# 182 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
%timeit aff_prop.predict(X)
# 254 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ogrisel
Copy link
Member
ogrisel commented Nov 2, 2020

The failure of the recently merged test_poisson_zero_nodes is unrelated. Probably a missing rng seed. Will do a PR.

@@ -38,10 +37,7 @@ def transform(self, X):
"""
check_is_fitted(self)

X = check_array(X)
if len(self.labels_) != X.shape[1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're assuming that the invariance of len(self.labels_) == X_train.shape[1] is enforced elsewhere, right? I guess this was only a workaround for not having n_features_in_ anyway.

f"Incorrect number of features. Got {n_features} features, "
f"expected {expected_n_features}.")

X = self._validate_data(X, accept_sparse='csr', reset=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the function now? but fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0