8000 BUG: clamping is implemented incorrectly in sklearn.semi_supervised.LabelSpreading · Issue #5774 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG: clamping is implemented incorrectly in sklearn.semi_supervised.LabelSpreading #5774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
musically-ut opened this issue Nov 9, 2015 · 8 comments
Milestone

Comments

@musically-ut
Copy link
Contributor

The code which does clamping in sklearn.semi_supervised.LabelSpreading appears to be incorrect:

    clamp_weights = np.ones((n_samples, 1))
    clamp_weights[unlabeled, 0] = self.alpha

    # ...

    y_static = np.copy(self.label_distributions_)
    if self.alpha > 0.:
        y_static *= 1 - self.alpha
    y_static[unlabeled] = 0

    # ...

    while ...:
        ...
        self.label_distributions_ = safe_sparse_dot(
            graph_matrix, self.label_distributions_)
        # clamp
        self.label_distributions_ = np.multiply(
            clamp_weights, self.label_distributions_) + y_static

This does the following:

  1. If ith sample is labeled, then: y_new[i] = 1.0 * M * y_old[i] + (1 - alpha) * y_init[i]
  2. If ith sample is unlabeled, then: y_new[i] = alpha * M * y_old[i] + 0.0

This is clearly incorrect. The correct way to do this is:

  1. If ith sample is labeled, then: y_new[i] = alpha * M * y_old[i] + (1 - alpha) * y_init[i]
  2. If ith sample is unlabeled, then: y_new[i] = 1.0 * M * y_old[i] + 0.0

The fix is relatively simple:

-clamp_weights[unlabeled, 0] = self.alpha
+clamp_weights[~unlabeled, 0] = self.alpha

I can create a PR for this but am not sure what kind of test cases I should add to avoid a regression, if any.


Test case:

samples = [[1., 0.], [0., 1.], [1., 2.5]]
labels = [0, 1, -1]
mdl = label_propagation.LabelSpreading(kernel='rbf', max_iter=5000)
mdl.fit(samples, labels)  # This will use up all 5000 iterations without converging

With the fix in place, it takes only 6 iterations.

@musically-ut
Copy link
Contributor Author

I just noticed that the definition of self.alpha in the documentation does not align with the implementation either. For labelled inputs, clamp_weights[i] for should be 1 - self.alpha and the y_static[i] should be self.alpha * y_init[i], not self.alpha and (1 - self.alpha) * y_init[i], respectively.

For example. if we set alpha = 1.0, which means that the labels for the labelled data must not change:

  1. Uncorrected version: For labelled input i: y_new[i] = 1 * M * y_old[i] + 0 * y_init[i]
  2. Corrected version: For labelled input i: y_new[i] = 0 * M * y_old[i] + 1 * y_init[i]

It is easy to see that only the corrected option will actually clamp the old labels.


Someone else should independently verify the mathematics and the documentation here as well.

@amueller
Copy link
Member

indeed. Please have a look at #3550, too. I am a bit too busy at the moment to follow up on this.

@musically-ut
Copy link
Contributor Author

Just a small update to help people who may wander on to this issue: I am currently maintaining a version of the semi-supervised learning algorithm forked from sklearn here: musically-ut/semi_supervised where, I think, I have fixed the bugs.

I intend to create a pull request if and when I have a good enough test-case to prove things one way or the other.

@amueller
Copy link
Member
amueller commented Feb 9, 2016

@musically-ut it would be great if you could provide a PR.

@boechat107
Copy link

@musically-ut I have done some tests with your code and it seems to be correctly clamping the labels. Do you have plans to submit a PR soon?

@musically-ut
Copy link
Contributor Author

@boechat107 Great to know and good to have some independent verification. :)

Just to be clear, I am assuming that you tested musically-ut/semi_supervised?

The primary problem before I make a PR is coming up with good tests. The current tests, for example, work with the current code, but it is not clear that they are obviously correct (and, indeed, my code produces different results on some of the test cases).

I spent a while trying to come up with test cases by hand-solving the problems, but couldn't find examples good enough to be obviously correct and which cover all the corners.

Beyond that, there are a few problems with the underlying theory of the implementation which I have fixed in my fork. E.g. the PhD thesis referenced does not consider directional k-NN graphs (only undirected versions), while the implementation uses directed version. I suspect that this was only an oversight. However, this is a major change and I will not be comfortable including it in sklearn without another expert looking at it.

@boechat107
Copy link

@musically-ut Yes, I used the code from the repo you maintain.

I didn't notice this detail about the graph and I think you are correct. Your implementation seems to fix this too, assuming it is another bug.

But what do you think about fixing one issue at a time? It seems clear that you fixed the label clamping (I'm going to check LabelSpreading's theory too), although I agree that better test cases are important. So, a PR to fix the label clamping and the alpha's documentation could be already proposed, couldn't it be? Then a specific issue for the graph could be opened, although I believe your implementation fixes that too.

I think I'll have some more days to work with this and I could help with something, like some test cases. @amueller, would you like to give us some comments? The PR #3758 looks inactive...

@jnothman
Copy link
Member
jnothman commented Jul 5, 2017

Fixed in #9239

@jnothman jnothman closed this as completed Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0