NMF for Kullback-Leibler divergence #1348

omangin · 2012-11-09T17:57:35Z

Adds the KLdivNMF class to decomposion.
The implemented method uses multiplicative updates. It is optimized to work with sparse matrices with lots of features.

"""Non negative factorization with Kullback Leibler divergence cost.

Parameters
----------
X: {array-like, sparse matrix}, shape = [n_samples, n_features]
    Data the model will be fit to.

n_components: int or None
    Number of components, if n_components is not set all components
    are kept

init:  'nndsvd' |  'nndsvda' | 'nndsvdar' | 'random'
    Method used to initialize the procedure.
    Default: 'nndsvdar' if n_components < n_features, otherwise random.
    Valid options::

        'nndsvd': Nonnegative Double Singular Value Decomposition (NNDSVD)
            initialization (better for sparseness)
        'nndsvda': NNDSVD with zeros filled with the average of X
            (better when sparsity is not desired)
        'nndsvdar': NNDSVD with zeros filled with small random values
            (generally faster, less accurate alternative to NNDSVDa
            for when sparsity is not desired)
        'random': non-negative random matrices

tol: double, default: 1e-4
    Tolerance value used in stopping conditions.

max_it
8000
er: int, default: 200
    Number of iterations to compute.

subit: int, default: 10
    Number of sub-iterations to perform on W (resp. H) before switching
    to H (resp. W) update.

Attributes
----------
`components_` : array, [n_components, n_features]
    Non-negative components of the data

random_state : int or RandomState
    Random number generator seed control.

Examples
--------

>>> import numpy as np
>>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> from sklearn.decomposition import KLdivNMF
>>> model = KLdivNMF(n_components=2, init='random', random_state=0)
>>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
KLdivNMF(eps=1e-08, init='random', max_iter=200, n_components=2,
        random_state=0, subit=10, tol=1e-06)
>>> model.components_
array([[ 0.50303234,  0.49696766],
       [ 0.93326505,  0.06673495]])

Notes
-----
This implements

Lee D. D., Seung H. S., Learning the parts of objects by non-negative
  matrix factorization. Nature, 1999
"""

Now computes the real generalized_KL even for sparse matrices. Also fix and extends test.

- Fixes NMF random test that was not reliable enough. - Adds flag to force parameter scaling in update when not in fit mode. - Cosmetics.

- better import in KL NMF test, - adds random state to NMF init params.

amueller · 2012-11-09T19:43:31Z

Hi @omangin.
Thanks for this contribution. This looks quite impressive. Can you say how it compares against the current implementation? I thought the additive updates are considered more stable then the multiplicative ones, but I'm not up to date. You implemented the original method from the nature paper, right?

So basically this one minimizes KL-Divergence while the current implementation minimizes MSE?
And they use different update schemes. I know there are also multiplicative updates for MSE, but I think we don't have that (right?).

Is there a qualitative difference between the results of this algorithm and the current implementation?

Best,
Andy :)

amueller · 2012-11-09T19:44:26Z

Btw if possible it might be nice to have an example to illustrate the difference between the two nmf methods.
I would guess that there is none in the faces example?

omangin · 2012-11-09T22:57:47Z

Hi,

Thanks for your comments.

This is the original implementation from Lee and Seung's Nature paper. It differs from the current implementation both because the objective it optimizes is based on Kullback-Leibler divergence and not Frobenius distance / MSE. It is also different because of the nature of the multiplicative updates it uses, instead of non-legative least square (NNLS) for the current implementation.

So, as far as I know there are multiplicative and gradient updates for both KL and Frobenius objective, whereas NNLS is specific to the Frobenius case.

From what I've heard (I will look back to the references when I have time), gradient based approaches are considered more stable and to converge to better local minima. On the other hand, multiplicative updates have faster initial convergence speed, and are also somehow simpler to implement because there is no need for step size adaptation. However this might change between the KL and Frobenius cases, and also the computational cost is not the same in the dense and sparse cases.

The choice between KL and Frobenius objectives is mainly problem dependent. I've included an example on the face dataset but it is not really illustrative of the difference. Furthermore I noticed that both methods are very sensible to initialization. Also the initialization methods currently are the same than in the previous implementation bu they might not be as relevant for the KL case as they were for the Frobenius cas. Perhaps I can also look for a better illustrative example.

Best,

Olivier

GaelVaroquaux · 2012-11-10T09:46:14Z

The choice between KL and Frobenius objectives is mainly problem dependent.

It is important to understand the tradeoffs and to explain them, with
examples, in the documentation.

Perhaps I can also look for a better illustrative example.

That would be great! Thanks a lot

mblondel · 2012-11-11T10:44:58Z

As far as I know the Lee and Seung paper doesn't have convergence guarantees. Both because of that and for integration with existing code, I think it would be preferable to implement the KL-divergence using the projected-gradient method. This way you just need to replace the gradient computations in the algorithm that we already have (we could add a loss parameter to the constructor).

The coordinate descent algorithm I mentioned in #896 is also pretty nice. It's basically like projected gradient except that it optimizes only one variable at a time. It selects variables greedily in the squared loss case and cyclically in the KL-divergence case.

agramfort · 2012-11-11T13:19:48Z

one benefit of multiplicative updates is that it applies out of the box to complex valued data such as used for audio and time series processing. It might be interesting to bench multiplicative updates and gradient / coordinate descent. Maybe there are some regimes for which multiplicative updates are faster.

ogrisel · 2012-11-11T13:31:24Z

IIRC @vene already benched a proto of the Lee and Seung method in a gist before deciding to go for the current implementation instead. @vene do you remember what your findings / benchmark results were at the time?

vene · 2012-11-11T13:36:03Z

From what I recall there was no clean-cut advantage, and we were thinking of implementing multiplicative updates too, at some point (note that the NMF class is just a wrapper around ProjectedGradientNMF because we thought ahead).

I would suggest renaming the class in this PR to MultiplicativeNMF and having a parameter loss="frobenius"|"KL".

Thanks for pinging me @ogrisel, I completely missed this thread.

mblondel · 2012-11-11T13:46:08Z

+1 to @vene's suggestion but loss="squared" might be better to be consistent with SGDRegressor.

ogrisel · 2012-11-11T13:51:41Z

Please: loss='kullback_leibler' instead of loss='KL': acronyms are often not googlable. Remember the time in your youth when you did not know the meaning of "KL".

vene · 2012-11-11T13:54:12Z

Please: loss='kullback_leibler' instead of loss='KL': acronyms are often not googlable.
+100, sorry for the poor suggestion. It's easy to lose track of what's clear.
in your youth
heh

omangin · 2012-11-14T12:54:16Z

From what I recall there was no clean-cut advantage, and we were thinking of implementing multiplicative updates too, at some point (note that the NMF class is just a wrapper around ProjectedGradientNMF because we thought ahead).

I had the same impression from the few comparisons I did in the past.

The coordinate descent algorithm I mentioned in #896 is also pretty nice. It's basically like projected gradient except that it optimizes only one variable at a time. It selects variables greedily in the squared loss case and cyclically in the KL-divergence case.

Seems interesting, I'll have a deeper look at it.

I would suggest renaming the class in this PR to MultiplicativeNMF and having a parameter loss="frobenius"|"KL".

I thought about it. That would actually be a quite straightforward modification. The only issue I see is that the updates in the KL case can be optimized when the data matrix is sparse. It is not guarenteed that the same optimization is possible for the Frobenius case. Anyway, it is probably not an issue if the doc clearly explains that changing the 'loss' parameter can change a lot the complexity of the underlying algorithm.

About the convergence guarentees, as far as I know there is no definitive answer for the multiplicative updates. Some aspects are studied in http://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf for the square loss.

It is probably not a lot of work to include both updates methods for both losses. However the purpose of sciki-learn may not be to include every possible method...

Sorry for the superficial answers. I don't have a lot of time currently for deeper investigations. However all these question are very relevant for me so I will definitely (re-)check my bibliography and bench a few things later.

amueller · 2012-11-14T12:56:39Z

Thanks for taking the comments into account @omangin. There is really no hurry.
Having a more detailed analysis before moving this forward would be great :)

ogrisel · 2012-11-14T13:50:09Z

We can also raise a ValueError when a method / loss combo is not available / invalid. This is already the case for LinearSVC that has both a primal and a dual optimizer that are not useable with all the possible loss functions.

mblondel · 2012-11-14T14:22:02Z

It is probably not a lot of work to include both updates methods for both losses. However the purpose of sciki-learn may not be to include every possible method...

In that case I would support only projected gradient. As I said earlier, adding KL-divergence is just a matter of adding the gradient. All the rest should remain the same.

ogrisel · 2012-11-14T14:59:04Z

Would @omangin's sparse data optimization for the KL loss be transferable to projected gradients?

mblondel · 2012-11-14T15:20:55Z

I think @omangin was referring to the fact that it might not be possible to handle the frobenius norm efficiently in the case of multiplicative updates. I can't comment as I'm only familiar with projected gradient.

I was talking about adding KL-divergence to the projected gradient implementation we already have. I see no reason it cannot support sparse data efficiently.

omangin · 2012-11-14T16:21:31Z

Would @omangin's sparse data optimization for the KL loss be transferable to projected gradients?

The answer is yes.

The optimization is actually just grounded on the fact that the multiplicative update for KL uses the element-wise division (X / WH), X being the data matrix, and WH the learnt factorization.

In the case where X is sparse and big and W and H are not, it is very long to compute (and sometime not possible to store in memory) the product WH (which by default is dense). Since only the element-wise division is needed by the algorithm it can be computed only where the coefficients in X are nonzero.

Both the multiplicative and the gradient only uses the value of X/WH, thus this optimization holds for both algorithm.

ogrisel · 2012-11-14T16:33:20Z

Both the multiplicative and the gradient only uses the value of X/WH, thus this optimization holds for both algorithm.

Ok great, thanks for the clarification. I have no opinion on whether or not we should include the multiplicative updates optimizer (we probably would require at least one example where it fares better in some way than the existing projected gradient method) but we should definitely add support for loss=kullback_leibler with the sparsity optim.

amueller · 2012-11-14T16:34:52Z

I agree with @ogrisel :)

mblondel · 2014-01-08T04:54:34Z

@omangin Could you create a gist with your code and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets?

omangin · 2014-02-04T11:25:00Z

As suggested by @mblondel, I added the code as a third part code snipet. See the third party project page and the gist.

mblondel · 2014-02-04T12:58:52Z

Thank you very much @omangin! I tweeted your gist https://twitter.com/mblondel_ml/status/430686434526130176

sreenivasraghavan71 · 2015-07-25T07:41:39Z

Hi, I am to new to this contribution. I am interested in this topic NMF. Can someone give me some good algorithm pertaining to NMF which must be included? Please give me some topic so that I can start a bit early

amueller · 2015-07-30T21:36:42Z

There is a comprehensive discussion here #4811 and here #4852. You could ask @TomDLT if he wants any help with #4852.

amueller · 2016-10-25T19:48:21Z

closing in favor of #5295

omangin added 13 commits November 9, 2012 18:39

Add code and tests for KL divergence NMF.

093bb76

Adds compatibility with sparse matrices to KLdivNMF.

cae0685

Optimize _sparse_dot product for KLdivNMF.

49c1580

FIX error function for KLdivNMF.

2cc29cd

Now computes the real generalized_KL even for sparse matrices. Also fix and extends test.

Adds an example for KLdivNMF in face decompositions.

dce326f

Minor mods to KLdivNMF.

4654597

- Fixes NMF random test that was not reliable enough. - Adds flag to force parameter scaling in update when not in fit mode. - Cosmetics.

Doc.

5c35817

Adds flag to KLdivNMF to force W scale in updates.

f284035

Minor change to tests of KLdivNMF.

a3fee55

Minor changes.

7be600e

- better import in KL NMF test, - adds random state to NMF init params.

Ignors BaseNMF and KLdivNMF in common tests.

4eee617

Fixes doctest (KldivNMF).

e2574f8

DOC.

d4da2cd

mblondel mentioned this pull request Oct 21, 2013

[WIP] NMF estimator based on the Lee and Seung algorithm #2540

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

TomDLT mentioned this pull request Jun 3, 2015

Discussion about adding NMF methods #4811

Closed

TomDLT mentioned this pull request Nov 19, 2015

[MRG+1] Add multiplicative-update solver in NMF, with all beta-divergence #5295

Merged

amueller closed this Oct 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NMF for Kullback-Leibler divergence #1348

NMF for Kullback-Leibler divergence #1348

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NMF for Kullback-Leibler divergence #1348

NMF for Kullback-Leibler divergence #1348

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!