10000 [BUG] LabelPropagation should use undirected graphs for knn kernel. · Issue #8008 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[BUG] LabelPropagation should use undirected graphs for knn kernel. #8008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
musically-ut opened this issue Dec 7, 2016 · 10 comments
Open

Comments

@musically-ut
Copy link
Contributor

This is with reference to the PR #6727 and is discussed in the comments in #5774.

Description

The current implementation of LabelPropagation, in semi_supervised module, creates directed k-NN graphs when the knn kernel is chosen.

One way to fix it is as shown in musically-ut/semi_supervised.

Steps to Reproduce / Expected Results / Actual Results

This is part of the reason I didn't submit it as a separate issue: it is difficult to come up with test cases which are obviously correct or incorrect.

However, none of the papers about semi-supervised learning cited in #6727 considers directed graphs. Using directed graphs has some obvious problems, like rendering parts of the graph completely unreachable, thus limiting the propagation of labels only from roots to descendants.

@jnothman
Copy link
Member
jnothman commented Dec 8, 2016

So is the issue of non-symmetric restricted to ties? I wonder if we should have a more consistent method for handling ties in KNN.

@musically-ut
Copy link
Contributor Author
musically-ut commented Dec 8, 2016

@jnothman No, not really. Consider the following example (k = 2):

Asymmetric kNN

Here the kNN graph is asymmetric because the lone point does not have any edges pointed to it.

Was that your confusion? Does this issue make sense now?

@amueller
Copy link
Member
amueller commented Dec 8, 2016

nice drawing :) [or you can think of points on a line at np.arange(n)** 2, then the nearest neighbor is always the point to the left]

@musically-ut
Copy link
Contributor Author

So any ideas about how to fix the problem, apart from the crude solution I have implemented?

I suspect that the call to _get_kernel with y != None (which is used here) will also need the same fix.

@jnothman
Copy link
Member
jnothman commented Dec 9, 2016 via email

musically-ut added a commit to musically-ut/scikit-learn that referenced this issue Jul 23, 2017
Also, fix/update the tests.
Fixes scikit-learn#8008.
musically-ut added a commit to musically-ut/scikit-learn that referenced this issue Jul 28, 2017
Also, fix/update the tests.
Fixes scikit-learn#8008.
@jnothman
Copy link
Member
jnothman commented Jul 29, 2017

Could you please summarise here what possible (ideally not hacky) options we have for undirected KNN kernels? Thanks a lot.

@piccolbo
Copy link
piccolbo commented Sep 1, 2017

A knn graph is a directed graph and non symmetric, in general. If we need symmetry #6727 is a very reasonable solution, but not unique or natural. But it implements a new kernel. I don't think it's a matter of hacky or not.

@piccolbo
Copy link
piccolbo commented Sep 1, 2017

Another comment is that symmetry does not imply reachability. Imagine two well separated k-clusters with the available labels all in one with a symmetryzed knn kernel. The label free cluster should remain unlabeled afaik.

@musically-ut
Copy link
Contributor Author

Yes, the make-the-kNN-matrix-symmetric solution is a bit of a hack. It doesn't guarantee connectedness and does not have any theoretical justification (in as far as I can search for).

This is a paper which takes a more methodical approach: http://ieeexplore.ieee.org/abstract/document/4358958/?part=1

Wang, Fei, and Changshui Zhang. "Label propagation through linear neighborhoods." IEEE Transactions on Knowledge and Data Engineering 20.1 (2008): 55-67.

The underlying idea behind the paper is to first create a k-NN graph, then infer the weight of each edge by minimizing the linear-reconstruction error:

sum( (x[i] - w[i][j] * x[j]) **2 for i in Vertex(G) for j in kNN(i, G) )

such that sum(x[i][j] for j in kNN(i, G)) == 1 for all i. After that, it uses a LabelPropagation like approach to propagate the labels. I'm not very sure how to solve the constrained the LP problem efficiently to create the kernel.

More generally, for semi supervised learning on directed graphs, this is one of the key references: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/SDG.pdf

Zhou, Denny, Thomas Hofmann, and Bernhard Schölkopf. "Semi-supervised learning on directed graphs." Advances in neural information processing systems. 2005.

Their algorithm uses one extra parameter lambda to balance out the weighing of transduction over hubs and authorities, but otherwise can be reduced to LabelSpreading algorithm with an altered S matrix in polynomial time (requires figuring out the in-degree and out-degree of each node).

I'll add more references as and when I find them.

@piccolbo
Copy link
piccolbo commented Sep 12, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0