-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Label Spreading: knn Kernel Variant differs from Reference Paper #11784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
These sound reasonable... I think we need to work out whether they justify
a backwards-incompatible change, or whether we should use some parameter
setting or similar to phase it in.
Ping @musically-ut
|
This would be a step in the right direction, but some of the core problems with kNN kernel will still remain (See #8008). Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception:
I've been meaning to give the kNN kernels an overhaul for a while now, but each step counts. 👍
I think as long as the examples and the test cases do not break, these changes can be considered bug fixes rather than feature additions. So, personally, I do not think that they will need a parameter for flow control. |
Just to add on that - Bengio et al.: In Semi-Supervised Learning (2006), one of the current reference papers, explicitly mentions a k-nearest neighbor matrix a couple of times. Though, in one of the next sentences in the introduction, they assume W_ij to be given by a symmetric positive function. If I understand that correctly, that means they assume W to be symmetric - something that knn cannot directly deliver. |
I suggest giving mode : {‘connectivity’, ‘distance’} as an additional parameter for kNN in order to get the (un)weighted kNN-graph. Further, I suggest giving self_loops as an additional parameter. Maybe one could also explain in the documentation that n_neighbours=1 for kNN is useless, since it will (at the moment) create a kNN-graph where the only edges are self-loops for every vertex. So basically n_neighbours=k is what I would call (currently directed) "(k-1)-nearest neighbour"-graph :D |
Improvements to the documentation in the first instance are very welcome.
|
According to label_propagation.py#L487, the reference paper for the Label Spreading implementation is Zhou et al.: Learning with local and global consistency (2004).
But there are two things not in line with the paper when using the knn kernel variant for Label Spreading:
(A) Self-loops in affinity matrix
According to the paper, the adjacency matrix should have no self-loops (W_ii = 0). But when taking a look on the relevant line in _get_kernel, L134 (called with
y=None
, see L516 for Label Spreading and L397 for Label Propagation,kneighbors_graph
get's called withmode='connectivity'
, which results ininclude_self=True
, according to the docs.Therefore, a more correct way would be calling it with
include_self=False
.Note that this change would not only modify the kneighbors_graph affinity matrix used for Label Spreading, but also the one for Label Propagation. When looking at the reference paper for Label Propagation, Zhu et al.: Learning from labeled and unlabeled data with label propagation (2002), they seem not to require that there should be no-self loops.
But in sklearn's other reference, Bengio et al.: In Semi-Supervised Learning (2006), mentioned here, they advocate forcing W_ii=0 anyways - so I would use
include_self=False
for Label Propagation's kneighbors_graph as well.(B) Laplacian Matrix Calculation
In label_propagation#L517, we see that
sparse.csgraph.laplacian
gets called. The default value for this method'suse_out_degree
isFalse
. This is not a problem if the affinity matrix is symmetric, but as knn graphs usually are not symmetric,use_out_degree=True
should be used.One can better see this issue when getting the diagonal via
sparse.csgraph.laplacian(..., normed=False, return_diag=True)
- whenuse_out_degree=False
, the diagonal values won't resemble thek
fromkneighbors_graph
.Summary
To conform with the reference paper, we should
kneighbors_graph
withinclude_self=False
sparse.csgraph.laplacian
withuse_out_degree=True
If you agree with me, I can submit this small PR and fix the tests, in case one breaks. 😊
The text was updated successfully, but these errors were encountered: