8000 [BUG] Label propagation sometimes produces label_distributions that contain Nan. · Issue #9292 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

[BUG] Label propagation sometimes produces label_distributions that contain Nan. #9292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alpapado opened this issue Jul 7, 2017 · 17 comments · Fixed by #19271
Closed

[BUG] Label propagation sometimes produces label_distributions that contain Nan. #9292

alpapado opened this issue Jul 7, 2017 · 17 comments · Fixed by #19271
Labels

Comments

@alpapado
Copy link
alpapado commented Jul 7, 2017

Description

Invalid value encountered in true_divide through when calling fit on LabelSpreading.

After convergence, the label distribution for some samples is all zero and so the variable normalizer in label_propagation.py:291 contains some zero values causing the division self.label_disributions_ /= normalizer to produce NaN.

Maybe there is a connection to #8008? In other datasets, increasing the n_neighbors parameter to a larger than the default value, caused the issue not to appear.

Steps/Code to Reproduce

from sklearn.datasets import fetch_mldata
from sklearn.semi_supervised import label_propagation
import numpy

numpy.seterr(all='raise')

mnist = fetch_mldata('MNIST original', data_home="./tmp")

X = mnist.data[1:10000]
y = mnist.target[1:10000]

# Use only 300 labeled examples
y[300:] = -1

lp_model = label_propagation.LabelSpreading(kernel='knn', n_neighbors=7, n_jobs=-1)
lp_model.fit(X,y)

Expected Results

No error is thrown.

Actual Results

  File "reproduce.py", line 16, in <module>
    lp_model.fit(X,y)
  File "...anaconda3/envs/ssl-py3/lib/python3.6/site-packages/sklearn/semi_supervised/label_propagation.py", line 291, in fit
    self.label_distributions_ /= normalizer
FloatingPointError: invalid value encountered in true_divide

Versions

[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.0
SciPy 0.19.0
Scikit-Learn 0.19.dev0

@jnothman
Copy link
Member
jnothman commented Jul 8, 2017 via email

@musically-ut
Copy link
Contributor
musically-ut commented Jul 8, 2017

The knn kernel is a bit circumspect because of the directed edges it produces and a fix is in the works.

Can this be reproduced with the rbf kernel as well?

@alpapado
Copy link
Author
alpapado commented Jul 8, 2017

Yes I believe it is on the current master. That is, it contains the latest pull request regarding the label_propagation module.

In the above code, changing the kernel to rbf produces a different, although similar, error:

Traceback (most recent call last):
  File "reproduce.py", line 21, in <module>
    lp_model.fit(X,y)
  File "/share/mug/gentoo/anaconda3/envs/ssl-py3/lib/python3.6/site-packages/sklearn/semi_supervised/label_propagation.py", line 234, in fit
    graph_matrix = self._build_graph()
  File "/share/mug/gentoo/anaconda3/envs/ssl-py3/lib/python3.6/site-packages/sklearn/semi_supervised/label_propagation.py", line 511, in _build_graph
    affinity_matrix = self._get_kernel(self.X_)
  File "/share/mug/gentoo/anaconda3/envs/ssl-py3/lib/python3.6/site-packages/sklearn/semi_supervised/label_propagation.py", line 131, in _get_kernel
    return rbf_kernel(X, X, gamma=self.gamma)
  File "/share/mug/gentoo/anaconda3/envs/ssl-py3/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 837, in rbf_kernel
    np.exp(K, K)    # exponentiate K in-place
FloatingPointError: underflow encountered in exp

@musically-ut
Copy link
Contributor

Changing gamma should be able to circumvent the underflow in case of rbf kernel. We do need more tests for numerical stability in both cases.

@jnothman jnothman added the Bug label Jul 8, 2017
@jnothman
Copy link
Member
jnothman commented Jul 8, 2017

How might we avoid that underflow while calculating the kernel?

@musically-ut
Copy link
Contributor

I think detecting underflows in the kernel is a separate issue. Is there an example of any method in scikit-learn which detects under/over flows?

In this particular case, I think underflows can be safely set to zero; it would just mean that certain nodes are not connected to each other. We can use numpy.seterr to "ignore" underflows before the call to the kernel and reset default behavior immediately afterwards.

Sounds reasonable?

@jnothman
Copy link
Member

There appear to be at least a couple of cases where we explicitly ignore underflow

@jnothman
Copy link
Member

Yes, your solution to underflow seems reasonable. Not the same as this issue, though

@musically-ut
Copy link
Contributor

Hmm, I've been thinking about edge cases here.

If we end up with a connected component which does not have any labeled nodes within it, then the normalization will produce NaNs since the row of label_distribution_ will contain zeros irrespective of how many iterations we run the algorithm for (there are no nodes which can transduce any labels).

I think it is reasonable to give the output 1/n probability for all classes, but I am not sure how predict should break ties. Are there any examples of such tie-breaking?

@alpapado
Copy link
Author

If we end up with a connected component which does not have any labeled nodes within it, then the normalization will produce NaNs since the row of label_distribution_ will contain zeros irrespective of how many iterations we run the algorithm for (there are no nodes which can transduce any labels).

I believe this is exactly the issue I am facing. It agrees with the observation that increasing the n_neighbors parameter to large enough values, prevents the error from occuring, which makes sense since increasing n_neighbors decreases the chances of ending up with a node that is not connected to any labeled nodes.

@irenelizeth
Copy link

I have replicated this issue when instantiating the LabelSpreading model with the default parameter values, i.e., LabelSpreading(). When I switch to instantiate it with LabelSpreading(gamma=0.25, max_iter=5) then the error is not thrown. Even when using gamma=0, max_iter=1 to instantiate LabelSpreading works fine just not defining the values for those parameters generates the issue:
label_propagation.py:293: RuntimeWarning: invalid value encountered in divide
self.label_distributions_ /= normalizer

@musically-ut
Copy link
Contributor

Since some other people have also faced this issue (personal communication), I think it would be best to take the following steps:

  1. Produce a warning if np.any(normalizer == 0) for either LabelPropagation or LabelSpreading which tells the user to either increase n_neighbors (for knn kernel) or decrease gamma (for rbf kernel).
  2. Replace label_distribution_[:, idx] if normalizer[idx] == 0 with 1/n.
  3. During prediction, select the first class arbitrarily (default behavior of np.argmax).

How does that sound?

@jnothman
Copy link
Member

Is 1/n more reasonable than the empirical distribution of known labels?

@musically-ut
Copy link
Contributor

1/n_empirical does sound more reasonable, and gives us a reasonable way of breaking the ties in .predict, the most likely class will be selected always.

Any comments about whether to show a warning as above or whether to throw an error (making the point moot) in case np.any(normalizer == 0)?

@jnothman
Copy link
Member
jnothman commented Nov 7, 2018 via email

@ni-apurata
Copy link

I'm also having serious problems with this bug. Has there been any progress towards fixing it or finding workarounds?

@ra312
Copy link
ra312 commented Jul 29, 2020

I am just wonderinf if this issue has been fixed ? Any updates? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0