8000 [MRG+1] Fix semi_supervised (#9239) · paulha/scikit-learn@e873053 · GitHub
[go: up one dir, main page]

Skip to content

Commit e873053

Browse files
musically-utpaulha
authored andcommitted
[MRG+1] Fix semi_supervised (scikit-learn#9239)
* Files for my dev environment with Docker * Fixing label clamping (alpha=0 for hard clamping) * Deprecating alpha, fixing its value to zero * Correct way to deprecate alpha for LabelPropagation The previous way was breaking the test sklearn.tests.test_common.test_all_estimators * Detailed info for LabelSpreading's alpha parameter Based on the original paper. * Minor changes in the deprecation message * Improving "deprecated" doc string and raising DeprecationWarning * Using a local "alpha" in "fit" to deprecate LabelPropagation's alpha This solution isn't great, but it sets the correct value for alpha without violating the restrictions imposed by the tests. * Removal of my development files * Using sphinx's "deprecated" tag (jnothman's suggestion) * Deprecation warning: stating that the alpha's value will be ignored * Use __init__ with alpha=None * Update what's new * Try fix RuntimeWarning in test_alpha_deprecation * DOC Indent deprecation details * DOC wording * Update docs * Change to the one true implementation. * Add sanity-checked impl. of Label{Propagation,Spreading} * Raise ValueError if alpha is invalid in LabelSpreading. * Add a normalizing step before clamping to LabelPropagation. * Fix flake8 errors. * Remove duplicate imports. * DOC Update What's New. * Specify alpha's value in the error. * Tidy up tests. Add a test and add references, where needed. * Add comment to non-regression test. * Fix documentation. * Move check for alpha into fit from __init__. * Fix corner case of LabelSpreading with alpha=None. * alpha -> self.variant * Make Whats_new more explicit. * Simplify impl. of Label{Propagation,Spreading}. * variant -> _variant.
1 parent 6e38e00 commit e873053

File tree

5 files changed

+160
-19
lines changed

5 files changed

+160
-19
lines changed

doc/modules/label_propagation.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ differ in modifications to the similarity matrix that graph and the
5252
clamping effect on the label distributions.
5353
Clamping allows the algorithm to change the weight of the true ground labeled
5454
data to some degree. The :class:`LabelPropagation` algorithm performs hard
55-
clamping of input labels, which means :math:`\alpha=1`. This clamping factor
56-
can be relaxed, to say :math:`\alpha=0.8`, which means that we will always
55+
clamping of input labels, which means :math:`\alpha=0`. This clamping factor
56+
can be relaxed, to say :math:`\alpha=0.2`, which means that we will always
5757
retain 80 percent of our original label distribution, but the algorithm gets to
5858
change its confidence of the distribution within 20 percent.
5959

doc/whats_new.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -448,7 +448,16 @@ Bug fixes
448448
in :class:`decomposition.PCA`,
449449
:class:`decomposition.RandomizedPCA` and
450450
:class:`decomposition.IncrementalPCA`.
451-
:issue:`9105` by `Hanmin Qin <https://github.com/qinhanmin2014>`_.
451+
:issue:`9105` by `Hanmin Qin <https://github.com/qinhanmin2014>`_.
452+
453+
- Fix :class:`semi_supervised.BaseLabelPropagation` to correctly implement
454+
``LabelPropagation`` and ``LabelSpreading`` as done in the referenced
455+
papers. :class:`semi_supervised.LabelPropagation` now always does hard
456+
clamping. Its ``alpha`` parameter has no effect and is
457+
deprecated to be removed in 0.21. :issue:`6727` :issue:`3550` issue:`5770`
458+
by :user:`Andre Ambrosio Boechat <boechat107>`, :user:`Utkarsh Upadhyay
459+
<musically-ut>`, and `Joel Nothman`_.
460+
452461

453462
API changes summary
454463
-------------------

examples/semi_supervised/plot_label_propagation_structure.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030

3131
# #############################################################################
3232
# Learn with LabelSpreading
33-
label_spread = label_propagation.LabelSpreading(kernel='knn', alpha=1.0)
33+
label_spread = label_propagation.LabelSpreading(kernel='knn', alpha=0.2)
3434
label_spread.fit(X, labels)
3535

3636
# #############################################################################

sklearn/semi_supervised/label_propagation.py

Lines changed: 61 additions & 15 deletions
F987
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@
1414
Model Features
1515
--------------
1616
Label clamping:
17-
The algorithm tries to learn distributions of labels over the dataset. In the
18-
"Hard Clamp" mode, the true ground labels are never allowed to change. They
19-
are clamped into position. In the "Soft Clamp" mode, they are allowed some
20-
wiggle room, but some alpha of their original value will always be retained.
21-
Hard clamp is the same as soft clamping with alpha set to 1.
17+
The algorithm tries to learn distributions of labels over the dataset given
18+
label assignments over an initial subset. In one variant, the algorithm does
19+
not allow for any errors in the initial assignment (hard-clamping) while
20+
in another variant, the algorithm allows for some wiggle room for the initial
21+
assignments, allowing them to change by a fraction alpha in each iteration
22+
(soft-clamping).
2223
2324
Kernel:
2425
A function which projects a vector into some higher dimensional space. This
@@ -55,6 +56,7 @@
5556
# License: BSD
5657
from abc import ABCMeta, abstractmethod
5758

59+
import warnings
5860
import numpy as np
5961
from scipy import sparse
6062

@@ -239,38 +241,55 @@ def fit(self, X, y):
239241

240242
n_samples, n_classes = len(y), len(classes)
241243

244+
alpha = self.alpha
245+
if self._variant == 'spreading' and \
246+
(alpha is None or alpha <= 0.0 or alpha >= 1.0):
247+
raise ValueError('alpha=%s is invalid: it must be inside '
248+
'the open interval (0, 1)' % alpha)
242249
y = np.asarray(y)
243250
unlabeled = y == -1
244-
clamp_weights = np.ones((n_samples, 1))
245-
clamp_weights[unlabeled, 0] = self.alpha
246251

247252
# initialize distributions
248253
self.label_distributions_ = np.zeros((n_samples, n_classes))
249254
for label in classes:
250255
self.label_distributions_[y == label, classes == label] = 1
251256

252257
y_static = np.copy(self.label_distributions_)
253-
if self.alpha > 0.:
254-
y_static *= 1 - self.alpha
255-
y_static[unlabeled] = 0
258+
if self._variant == 'propagation':
259+
# LabelPropagation
260+
y_static[unlabeled] = 0
261+
else:
262+
# LabelSpreading
263+
y_static *= 1 - alpha
256264

257265
l_previous = np.zeros((self.X_.shape[0], n_classes))
258266

259267
remaining_iter = self.max_iter
268+
unlabeled = unlabeled[:, np.newaxis]
260269
if sparse.isspmatrix(graph_matrix):
261270
graph_matrix = graph_matrix.tocsr()
262271
while (_not_converged(self.label_distributions_, l_previous, self.tol)
263272
and remaining_iter > 1):
264273
l_previous = self.label_distributions_
265274
self.label_distributions_ = safe_sparse_dot(
266275
graph_matrix, self.label_distributions_)
267-
# clamp
268-
self.label_distributions_ = np.multiply(
269-
clamp_weights, self.label_distributions_) + y_static
276+
277+
if self._variant == 'propagation':
278+
normalizer = np.sum(
279+
self.label_distributions_, axis=1)[:, np.newaxis]
280+
self.label_distributions_ /= normalizer
281+
self.label_distributions_ = np.where(unlabeled,
282+
self.label_distributions_,
283+
y_static)
284+
else:
285+
# clamp
286+
self.label_distributions_ = np.multiply(
287+
alpha, self.label_distributions_) + y_static
270288
remaining_iter -= 1
271289

272290
normalizer = np.sum(self.label_distributions_, axis=1)[:, np.newaxis]
273291
self.label_distributions_ /= normalizer
292+
274293
# set the transduction item
275294
transduction = self.classes_[np.argmax(self.label_distributions_,
276295
axis=1)]
@@ -299,7 +318,11 @@ class LabelPropagation(BaseLabelPropagation):
299318
Parameter for knn kernel
300319
301320
alpha : float
302-
Clamping factor
321+
Clamping factor.
322+
323+
.. deprecated:: 0.19
324+
This parameter will be removed in 0.21.
325+
'alpha' is fixed to zero in 'LabelPropagation'.
303326
304327
max_iter : float
305328
Change maximum number of iterations allowed
@@ -350,6 +373,14 @@ class LabelPropagation(BaseLabelPropagation):
350373
LabelSpreading : Alternate label propagation strategy more robust to noise
351374
"""
352375

376+
_variant = 'propagation'
377+
378+
def __init__(self, kernel='rbf', gamma=20, n_neighbors=7,
379+
alpha=None, max_iter=30, tol=1e-3, n_jobs=1):
380+
super(LabelPropagation, self).__init__(
381+
kernel=kernel, gamma=gamma, n_neighbors=n_neighbors, alpha=alpha,
382+
max_iter=max_iter, tol=tol, n_jobs=n_jobs)
383+
353384
def _build_graph(self):
354385
"""Matrix representing a fully connected graph between each sample
355386
@@ -366,6 +397,15 @@ class distributions will exceed 1 (normalization may be desired).
366397
affinity_matrix /= normalizer[:, np.newaxis]
367398
return affinity_matrix
368399

400+
def fit(self, X, y):
401+
if self.alpha is not None:
402+
warnings.warn(
403+
"alpha is deprecated since 0.19 and will be removed in 0.21.",
404+
DeprecationWarning
405+
)
406+
self.alpha = None
407+
return super(LabelPropagation, self).fit(X, y)
408+
369409

370410
class LabelSpreading(BaseLabelPropagation):
371411
"""LabelSpreading model for semi-supervised learning
@@ -391,7 +431,11 @@ class LabelSpreading(BaseLabelPropagation):
391431
parameter for knn kernel
392432
393433
alpha : float
394-
clamping factor
434+
Clamping factor. A value in [0, 1] that specifies the relative amount
435+
that an instance should adopt the information from its neighbors as
436+
opposed to its initial label.
437+
alpha=0 means keeping the initial label information; alpha=1 means
438+
replacing all initial information.
395439
396440
max_iter : float
397441
maximum number of iterations allowed
@@ -446,6 +490,8 @@ class LabelSpreading(BaseLabelPropagation):
446490
LabelPropagation : Unregularized graph based semi-supervised learning
447491
"""
448492

493+
_variant = 'spreading'
494+
449495
def __init__(self, kernel='rbf', gamma=20, n_neighbors=7, alpha=0.2,
450496
max_iter=30, tol=1e-3, n_jobs=1):
451497

sklearn/semi_supervised/tests/test_label_propagation.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,12 @@
33
import numpy as np
44

55
from sklearn.utils.testing import assert_equal
6+
from sklearn.utils.testing import assert_warns
7+
from sklearn.utils.testing import assert_raises
8+
from sklearn.utils.testing import assert_no_warnings
69
from sklearn.semi_supervised import label_propagation
710
from sklearn.metrics.pairwise import rbf_kernel
11+
from sklearn.datasets import make_classification
812
from numpy.testing import assert_array_almost_equal
913
from numpy.testing import assert_array_equal
1014

@@ -59,3 +63,85 @@ def test_predict_proba():
5963
clf = estimator(**parameters).fit(samples, labels)
6064
assert_array_almost_equal(clf.predict_proba([[1., 1.]]),
6165
np.array([[0.5, 0.5]]))
66+
67+
68+
def test_alpha_deprecation():
69+
X, y = make_classification(n_samples=100)
70+
y[::3] = -1
71+
72+
lp_default = label_propagation.LabelPropagation(kernel='rbf', gamma=0.1)
73+
lp_default_y = assert_no_warnings(lp_default.fit, X, y).transduction_
74+
75+
lp_0 = label_propagation.LabelPropagation(alpha=0, kernel='rbf', gamma=0.1)
76+
lp_0_y = assert_warns(DeprecationWarning, lp_0.fit, X, y).transduction_
77+
78+
assert_array_equal(lp_default_y, lp_0_y)
79+
80+
81+
def test_label_spreading_closed_form():
82+
n_classes = 2
83+
X, y = make_classification(n_classes=n_classes, n_samples=200,
84+
random_state=0)
85+
y[::3] = -1
86+
clf = label_propagation.LabelSpreading().fit(X, y)
87+
# adopting notation from Zhou et al (2004):
88+
S = clf._build_graph()
89+
Y = np.zeros((len(y), n_classes + 1))
90+
Y[np.arange(len(y)), y] = 1
91+
Y = Y[:, :-1]
92+
for alpha in [0.1, 0.3, 0.5, 0.7, 0.9]:
93+
expected = np.dot(np.linalg.inv(np.eye(len(S)) - alpha * S), Y)
94+
expected /= expected.sum(axis=1)[:, np.newaxis]
95+
clf = label_propagation.LabelSpreading(max_iter=10000, alpha=alpha)
96+
clf.fit(X, y)
97+
assert_array_almost_equal(expected, clf.label_distributions_, 4)
98+
99+
100+
def test_label_propagation_closed_form():
101+
n_classes = 2
102+
X, y = make_classification(n_classes=n_classes, n_samples=200,
103+
random_state=0)
104+
y[::3] = -1
105+
Y = np.zeros((len(y), n_classes + 1))
106+
Y[np.arange(len(y)), y] = 1
107+
unlabelled_idx = Y[:, (-1,)].nonzero()[0]
108+
labelled_idx = (Y[:, (-1,)] == 0).nonzero()[0]
109+
110+
clf = label_propagation.LabelPropagation(max_iter=10000,
111+
gamma=0.1).fit(X, y)
112+
# adopting notation from Zhu et al 2002
113+
T_bar = clf._build_graph()
114+
Tuu = T_bar[np.meshgrid(unlabelled_idx, unlabelled_idx, indexing='ij')]
115+
Tul = T_bar[np.meshgrid(unlabelled_idx, labelled_idx, indexing='ij')]
116+
Y = Y[:, :-1]
117+
Y_l = Y[labelled_idx, :]
118+
Y_u = np.dot(np.dot(np.linalg.inv(np.eye(Tuu.shape[0]) - Tuu), Tul), Y_l)
119+
120+
expected = Y.copy()
121+
expected[unlabelled_idx, :] = Y_u
122+
expected /= expected.sum(axis=1)[:, np.newaxis]
123+
124+
assert_array_almost_equal(expected, clf.label_distributions_, 4)
125+
126+
127+
def test_valid_alpha():
128+
n_classes = 2
129+
X, y = make_classification(n_classes=n_classes, n_samples=200,
130+
random_state=0)
131+
for alpha in [-0.1, 0, 1, 1.1, None]:
132+
assert_raises(ValueError,
133+
lambda **kwargs:
134+
label_propagation.LabelSpreading(**kwargs).fit(X, y),
135+
alpha=alpha)
136+
137+
138+
def test_convergence_speed():
139+
# This is a non-regression test for #5774
140+
X = np.array([[1., 0.], [0., 1.], [1., 2.5]])
141+
y = np.array([0, 1, -1])
142+
mdl = label_propagation.LabelSpreading(kernel='rbf', max_iter=5000)
143+
mdl.fit(X, y)
144+
145+
# this should converge quickly:
146+
assert mdl.n_iter_ < 10
147+
assert_array_equal(mdl.predict(X), [0, 1, 1])

0 commit comments

Comments
 (0)
0