8000 Label Spreading: knn Kernel Variant differs from Reference Paper · Issue #11784 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
markusdosch opened this issue Aug 9, 2018 · 5 comments
Open

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

markusdosch opened this issue Aug 9, 2018 · 5 comments

Comments

@markusdosch
Copy link

According to label_propagation.py#L487, the reference paper for the Label Spreading implementation is Zhou et al.: Learning with local and global consistency (2004).

But there are two things not in line with the paper when using the knn kernel variant for Label Spreading:

(A) Self-loops in affinity matrix

According to the paper, the adjacency matrix should have no self-loops (W_ii = 0). But when taking a look on the relevant line in _get_kernel, L134 (called with y=None, see L516 for Label Spreading and L397 for Label Propagation, kneighbors_graph get's called with mode='connectivity', which results in include_self=True, according to the docs.
Therefore, a more correct way would be calling it with include_self=False.

Note that this change would not only modify the kneighbors_graph affinity matrix used for Label Spreading, but also the one for Label Propagation. When looking at the reference paper for Label Propagation, Zhu et al.: Learning from labeled and unlabeled data with label propagation (2002), they seem not to require that there should be no-self loops.
But in sklearn's other reference, Bengio et al.: In Semi-Supervised Learning (2006), mentioned here, they advocate forcing W_ii=0 anyways - so I would use include_self=False for Label Propagation's kneighbors_graph as well.

(B) Laplacian Matrix Calculation

In label_propagation#L517, we see that sparse.csgraph.laplacian gets called. The default value for this method's use_out_degree is False. This is not a problem if the affinity matrix is symmetric, but as knn graphs usually are not symmetric, use_out_degree=True should be used.
One can better see this issue when getting the diagonal via sparse.csgraph.laplacian(..., normed=False, return_diag=True) - when use_out_degree=False, the diagonal values won't resemble the k from kneighbors_graph.

Summary

To conform with the reference paper, we should

If you agree with me, I can submit this small PR and fix the tests, in case one breaks. 😊

@jnothman
Copy link
Member
jnothman commented Aug 9, 2018 via email

@musically-ut
Copy link
Contributor
musically-ut 8000 commented Aug 9, 2018

This would be a step in the right direction, but some of the core problems with kNN kernel will still remain (See #8008). Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception:

Wang, Fei, and Changshui Zhang. "Label propagation through linear neighborhoods." IEEE Transactions on Knowledge and Data Engineering 20.1 (2008): 55-67.

I've been meaning to give the kNN kernels an overhaul for a while now, but each step counts. 👍

whether they justify a backwards-incompatible change

I think as long as the examples and the test cases do not break, these changes can be considered bug fixes rather than feature additions. So, personally, I do not think that they will need a parameter for flow control.

@markusdosch
Copy link
Author

Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception

Just to add on that - Bengio et al.: In Semi-Supervised Learning (2006), one of the current reference papers, explicitly mentions a k-nearest neighbor matrix a couple of times. Though, in one of the next sentences in the introduction, they assume W_ij to be given by a symmetric positive function. If I understand that correctly, that means they assume W to be symmetric - something that knn cannot directly deliver.
Would have been great if they had elaborated on that a bit more 😅

@hkoppen
Copy link
hkoppen commented Dec 13, 2018

I suggest giving mode : {‘connectivity’, ‘distance’} as an additional parameter for kNN in order to get the (un)weighted kNN-graph. Further, I suggest giving self_loops as an additional parameter.

Maybe one could also explain in the documentation that n_neighbours=1 for kNN is useless, since it will (at the moment) create a kNN-graph where the only edges are self-loops for every vertex. So basically n_neighbours=k is what I would call (currently directed) "(k-1)-nearest neighbour"-graph :D

@jnothman
Copy link
Member
jnothman commented Dec 16, 2018 via email

@thomasjpfan thomasjpfan removed the Needs Triage Issue requires triage label Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0