Label Spreading: knn Kernel Variant differs from Reference Paper #11784

markusdosch · 2018-08-09T13:11:01Z

According to label_propagation.py#L487, the reference paper for the Label Spreading implementation is Zhou et al.: Learning with local and global consistency (2004).

But there are two things not in line with the paper when using the knn kernel variant for Label Spreading:

(A) Self-loops in affinity matrix

According to the paper, the adjacency matrix should have no self-loops (W_ii = 0). But when taking a look on the relevant line in _get_kernel, L134 (called with y=None, see L516 for Label Spreading and L397 for Label Propagation, kneighbors_graph get's called with mode='connectivity', which results in include_self=True, according to the docs.
Therefore, a more correct way would be calling it with include_self=False.

Note that this change would not only modify the kneighbors_graph affinity matrix used for Label Spreading, but also the one for Label Propagation. When looking at the reference paper for Label Propagation, Zhu et al.: Learning from labeled and unlabeled data with label propagation (2002), they seem not to require that there should be no-self loops.
But in sklearn's other reference, Bengio et al.: In Semi-Supervised Learning (2006), mentioned here, they advocate forcing W_ii=0 anyways - so I would use include_self=False for Label Propagation's kneighbors_graph as well.

(B) Laplacian Matrix Calculation

In label_propagation#L517, we see that sparse.csgraph.laplacian gets called. The default value for this method's use_out_degree is False. This is not a problem if the affinity matrix is symmetric, but as knn graphs usually are not symmetric, use_out_degree=True should be used.
One can better see this issue when getting the diagonal via sparse.csgraph.laplacian(..., normed=False, return_diag=True) - when use_out_degree=False, the diagonal values won't resemble the k from kneighbors_graph.

Summary

To conform with the reference paper, we should

In label_propagation#L517, call kneighbors_graph with include_self=False
In label_propagation#L134, call sparse.csgraph.laplacian with use_out_degree=True

If you agree with me, I can submit this small PR and fix the tests, in case one breaks. 😊

The text was updated successfully, but these errors were encountered:

jnothman · 2018-08-09T14:19:48Z

These sound reasonable... I think we need to work out whether they justify a backwards-incompatible change, or whether we should use some parameter setting or similar to phase it in. Ping @musically-ut

musically-ut · 2018-08-09T14:36:12Z

This would be a step in the right direction, but some of the core problems with kNN kernel will still remain (See #8008). Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception:

Wang, Fei, and Changshui Zhang. "Label propagation through linear neighborhoods." IEEE Transactions on Knowledge and Data Engineering 20.1 (2008): 55-67.

I've been meaning to give the kNN kernels an overhaul for a while now, but each step counts. 👍

whether they justify a backwards-incompatible change

I think as long as the examples and the test cases do not break, these changes can be considered bug fixes rather than feature additions. So, personally, I do not think that they will need a parameter for flow control.

markusdosch · 2018-08-09T14:56:08Z

Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception

Just to add on that - Bengio et al.: In Semi-Supervised Learning (2006), one of the current reference papers, explicitly mentions a k-nearest neighbor matrix a couple of times. Though, in one of the next sentences in the introduction, they assume W_ij to be given by a symmetric positive function. If I understand that correctly, that means they assume W to be symmetric - something that knn cannot directly deliver.
Would have been great if they had elaborated on that a bit more 😅

hkoppen · 2018-12-13T14:32:27Z

I suggest giving mode : {‘connectivity’, ‘distance’} as an additional parameter for kNN in order to get the (un)weighted kNN-graph. Further, I suggest giving self_loops as an additional parameter.

Maybe one could also explain in the documentation that n_neighbours=1 for kNN is useless, since it will (at the moment) create a kNN-graph where the only edges are self-loops for every vertex. So basically n_neighbours=k is what I would call (currently directed) "(k-1)-nearest neighbour"-graph :D

jnothman · 2018-12-16T03:22:45Z

Improvements to the documentation in the first instance are very welcome.

cmarmo added Documentation module:semi_supervised Needs Triage Issue requires triage labels Jan 30, 2022

thomasjpfan removed the Needs Triage Issue requires triage label Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

Label Spreading: knn Kernel Variant differs from Reference Paper #11784

Comments

(A) Self-loops in affinity matrix

(B) Laplacian Matrix Calculation

Summary