8000 Partial centroid by Robinspecteur · Pull Request #13033 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Partial centroid #13033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed

Conversation

Robinspecteur
Copy link

Reference Issues/PRs

#12952

What does this implement/fix? Explain your changes.

Adds a partial_fit method for the NearestCentroid

Any other comments?

@adrinjalali
Copy link
Member

tests are failing

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

self.centroids_ = np.empty((n_classes, n_features), dtype=np.float64)
# Number of clusters in each class.
nk = np.zeros(n_classes)
old_nk = deepcopy(self.nk_)
Copy link 8000
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use self.nk_.copy() rather than deepcopy.

# Number of clusters in each class.
nk = np.zeros(n_classes)
old_nk = deepcopy(self.nk_)
old_centroids = deepcopy(self.true_centroids_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -110,53 +169,90 @@ def fit(self, X, y):
check_classification_targets(y)

n_samples, n_features = X.shape

if _refit or _check_partial_fit_first_call(self, classes):
self.true_classes_ = classes = np.asarray(classes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document the new attributes

y2 = [-1, -1, 1, -1, 1, 1]
clf = NearestCentroid()
clf.partial_fit(X2[:3], y2[:3], classes=[-1, 1])
assert_array_equal(clf.predict(T), [-1, 1, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is identical to true_result, so what have we shown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least check that the centroids differ from the complete fit?

clf = NearestCentroid()
clf.partial_fit(X[:3], y[:3], classes=[-1, 1])
clf.partial_fit(X[3:], y[3:])
assert_array_equal(clf.predict(T), true_result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that the centroids match using fit on the full dataset?

# Ensure that the shrinking is correct.
# The expected result is calculated by R (pamr),
# which is implemented by the author of the original paper.
# (One need to modify the code to output the new centroid in pamr.predict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify which code? I don't understand this. You could just show the R code used to derive these numbers.

@@ -219,19 +226,33 @@ def _partial_fit(self, X, y, classes=None, _refit=False):
self.centroids_ = self.true_centroids_[self.nk_ != 0]

if self.shrink_threshold:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend moving the two "deepcopy" calls on lines 192, 193 into the "if self.shrink_threshold" conditional on line 228. This way you only compute them when needed.

@@ -177,6 +177,10 @@ def _partial_fit(self, X, y, classes=None, _refit=False):
# Number of clusters in each class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, I find the warning on lines 214-217 to be quite annoying. This code was present in the original implementation, but it might make sense to move it to the "init" instead so the user only gets the warning once. Or we could move it to the "if _refit or _check_partial_fit_first_call(self, classes)" conditional on line 173. Previously, we only received the warning when calling "fit", which isn't too often. But "partial_fit" is meant to be called often on small batches of data, and this warning gets printed every time, which is redundant and tiresome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having it only appear on the first partial_fit is reasonable.

@amueller
Copy link
Member
amueller commented Aug 6, 2019

@Robinspecteur are you still working on this?

@dirkpadfield
Copy link
dirkpadfield commented Aug 6, 2019 via email

@amueller
Copy link
Member
amueller commented Aug 6, 2019

@dirkpadfield good to know. The question is mainly if there's a point for maintainers to spend more time reviewing at this point, and/or if we should open this up for someone else to complete. There's some failing tests and your comments haven't been addressed yet :)

Base automatically changed from master to main January 22, 2021 10:50
@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed Stalled labels Jul 9, 2022
@cmarmo
Copy link
Contributor
cmarmo commented Jul 9, 2022

Closing as superseded by #19262.

@cmarmo cmarmo closed this Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:neighbors Needs work Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

< 3B1C div class="participation">
6 participants
0