8000 NMF for Kullback-Leibler divergence by omangin · Pull Request #1348 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

NMF for Kullback-Leibler divergence #1348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Conversation

omangin
Copy link
@omangin omangin commented Nov 9, 2012

Adds the KLdivNMF class to decomposion.
The implemented method uses multiplicative updates. It is optimized to work with sparse matrices with lots of features.

"""Non negative factorization with Kullback Leibler divergence cost.

Parameters
----------
X: {array-like, sparse matrix}, shape = [n_samples, n_features]
    Data the model will be fit to.

n_components: int or None
    Number of components, if n_components is not set all components
    are kept

init:  'nndsvd' |  'nndsvda' | 'nndsvdar' | 'random'
    Method used to initialize the procedure.
    Default: 'nndsvdar' if n_components < n_features, otherwise random.
    Valid options::

        'nndsvd': Nonnegative Double Singular Value Decomposition (NNDSVD)
            initialization (better for sparseness)
        'nndsvda': NNDSVD with zeros filled with the average of X
            (better when sparsity is not desired)
        'nndsvdar': NNDSVD with zeros filled with small random values
            (generally faster, less accurate alternative to NNDSVDa
            for when sparsity is not desired)
        'random': non-negative random matrices

tol: double, default: 1e-4
    Tolerance value used in stopping conditions.

max_it
8000
er: int, default: 200
    Number of iterations to compute.

subit: int, default: 10
    Number of sub-iterations to perform on W (resp. H) before switching
    to H (resp. W) update.

Attributes
----------
`components_` : array, [n_components, n_features]
    Non-negative components of the data

random_state : int or RandomState
    Random number generator seed control.

Examples
--------

>>> import numpy as np
>>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> from sklearn.decomposition import KLdivNMF
>>> model = KLdivNMF(n_components=2, init='random', random_state=0)
>>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
KLdivNMF(eps=1e-08, init='random', max_iter=200, n_components=2,
        random_state=0, subit=10, tol=1e-06)
>>> model.components_
array([[ 0.50303234,  0.49696766],
       [ 0.93326505,  0.06673495]])

Notes
-----
This implements

Lee D. D., Seung H. S., Learning the parts of objects by non-negative
  matrix factorization. Nature, 1999
"""

@amueller
Copy link
Member
amueller commented Nov 9, 2012

Hi @omangin.
Thanks for this contribution. This looks quite impressive. Can you say how it compares against the current implementation? I thought the additive updates are considered more stable then the multiplicative ones, but I'm not up to date. You implemented the original method from the nature paper, right?

So basically this one minimizes KL-Divergence while the current implementation minimizes MSE?
And they use different update schemes. I know there are also multiplicative updates for MSE, but I think we don't have that (right?).

Is there a qualitative difference between the results of this algorithm and the current implementation?

Best,
Andy :)

@amueller
Copy link
Member
amueller commented Nov 9, 2012

Btw if possible it might be nice to have an example to illustrate the difference between the two nmf methods.
I would guess that there is none in the faces example?

@omangin
Copy link
Author
omangin commented Nov 9, 2012

Hi,

Thanks for your comments.

This is the original implementation from Lee and Seung's Nature paper. It differs from the current implementation both because the objective it optimizes is based on Kullback-Leibler divergence and not Frobenius distance / MSE. It is also different because of the nature of the multiplicative updates it uses, instead of non-legative least square (NNLS) for the current implementation.

So, as far as I know there are multiplicative and gradient updates for both KL and Frobenius objective, whereas NNLS is specific to the Frobenius case.

From what I've heard (I will look back to the references when I have time), gradient based approaches are considered more stable and to converge to better local minima. On the other hand, multiplicative updates have faster initial convergence speed, and are also somehow simpler to implement because there is no need for step size adaptation. However this might change between the KL and Frobenius cases, and also the computational cost is not the same in the dense and sparse cases.

The choice between KL and Frobenius objectives is mainly problem dependent. I've included an example on the face dataset but it is not really illustrative of the difference. Furthermore I noticed that both methods are very sensible to initialization. Also the initialization methods currently are the same than in the previous implementation bu they might not be as relevant for the KL case as they were for the Frobenius cas. Perhaps I can also look for a better illustrative example.

Best,

Olivier

@GaelVaroquaux
Copy link
Member

The choice between KL and Frobenius objectives is mainly problem dependent.

It is important to understand the tradeoffs and to explain them, with
examples, in the documentation.

Perhaps I can also look for a better illustrative example.

That would be great! Thanks a lot

@mblondel
Copy link
Member

As far as I know the Lee and Seung paper doesn't have convergence guarantees. Both because of that and for integration with existing code, I think it would be preferable to implement the KL-divergence using the projected-gradient method. This way you just need to replace the gradient computations in the algorithm that we already have (we could add a loss parameter to the constructor).

The coordinate descent algorithm I mentioned in #896 is also pretty nice. It's basically like projected gradient except that it optimizes only one variable at a time. It selects variables greedily in the squared loss case and cyclically in the KL-divergence case.

@agramfort
Copy link
Member

one benefit of multiplicative updates is that it applies out of the box to complex valued data such as used for audio and time series processing. It might be interesting to bench multiplicative updates and gradient / coordinate descent. Maybe there are some regimes for which multiplicative updates are faster.

@ogrisel
Copy link
Member
ogrisel commented Nov 11, 2012

IIRC @vene already benched a proto of the Lee and Seung method in a gist before deciding to go for the current implementation instead. @vene do you remember what your findings / benchmark results were at the time?

@vene
Copy link
Member
vene commented Nov 11, 2012

From what I recall there was no clean-cut advantage, and we were thinking of implementing multiplicative updates too, at some point (note that the NMF class is just a wrapper around ProjectedGradientNMF because we thought ahead).

I would suggest renaming the class in this PR to MultiplicativeNMF and having a parameter loss="frobenius"|"KL".

Thanks for pinging me @ogrisel, I completely missed this thread.

@mblondel
Copy link
Member

+1 to @vene's suggestion but loss="squared" might be better to be consistent with SGDRegressor.

@ogrisel
Copy link
Member
ogrisel commented Nov 11, 2012

Please: loss='kullback_leibler' instead of loss='KL': acronyms are often not googlable. Remember the time in your youth when you did not know the meaning of "KL".

@vene
Copy link
Member
vene commented Nov 11, 2012

Please: loss='kullback_leibler' instead of loss='KL': acronyms are often not googlable.
+100, sorry for the poor suggestion. It's easy to lose track of what's clear.
in your youth
heh

@omangin
Copy link
Author
omangin commented Nov 14, 2012

From what I recall there was no clean-cut advantage, and we were thinking of implementing multiplicative updates too, at some point (note that the NMF class is just a wrapper around ProjectedGradientNMF because we thought ahead).

I had the same impression from the few comparisons I did in the past.

The coordinate descent algorithm I mentioned in #896 is also pretty nice. It's basically like projected gradient except that it optimizes only one variable at a time. It selects variables greedily in the squared loss case and cyclically in the KL-divergence case.

Seems interesting, I'll have a deeper look at it.

I would suggest renaming the class in this PR to MultiplicativeNMF and having a parameter loss="frobenius"|"KL".

I thought about it. That would actually be a quite straightforward modification. The only issue I see is that the updates in the KL case can be optimized when the data matrix is sparse. It is not guarenteed that the same optimization is possible for the Frobenius case. Anyway, it is probably not an issue if the doc clearly explains that changing the 'loss' parameter can change a lot the complexity of the underlying algorithm.

About the convergence guarentees, as far as I know there is no definitive answer for the multiplicative updates. Some aspects are studied in http://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf for the square loss.

It is probably not a lot of work to include both updates methods for both losses. However the purpose of sciki-learn may not be to include every possible method...

Sorry for the superficial answers. I don't have a lot of time currently for deeper investigations. However all these question are very relevant for me so I will definitely (re-)check my bibliography and bench a few things later.

@amueller
Copy link
Member

Thanks for taking the comments into account @omangin. There is really no hurry.
Having a more detailed analysis before moving this forward would be great :)

@ogrisel
Copy link
Member
ogrisel commented Nov 14, 2012

We can also raise a ValueError when a method / loss combo is not available / invalid. This is already the case for LinearSVC that has both a primal and a dual optimizer that are not useable with all the possible loss functions.

@mblondel
Copy link
Member

It is probably not a lot of work to include both updates methods for both losses. However the purpose of sciki-learn may not be to include every possible method...

In that case I would support only projected gradient. As I said earlier, adding KL-divergence is just a matter of adding the gradient. All the rest should remain the same.

@ogrisel
Copy link
Member
ogrisel commented Nov 14, 2012

Would @omangin's sparse data optimization for the KL loss be transferable to projected gradients?

@mblondel
Copy link
Member

I think @omangin was referring to the fact that it might not be possible to handle the frobenius norm efficiently in the case of multiplicative updates. I can't comment as I'm only familiar with projected gradient.

I was talking about adding KL-divergence to the projected gradient implementation we already have. I see no reason it cannot support sparse data efficiently.

@omangin
Copy link
Author
omangin commented Nov 14, 2012

Would @omangin's sparse data optimization for the KL loss be transferable to projected gradients?

The answer is yes.

The optimization is actually just grounded on the fact that the multiplicative update for KL uses the element-wise division (X / WH), X being the data matrix, and WH the learnt factorization.

In the case where X is sparse and big and W and H are not, it is very long to compute (and sometime not possible to store in memory) the product WH (which by default is dense). Since only the element-wise division is needed by the algorithm it can be computed only where the coefficients in X are nonzero.

Both the multiplicative and the gradient only uses the value of X/WH, thus this optimization holds for both algorithm.

@ogrisel
Copy link
Member
ogrisel commented Nov 14, 2012

Both the multiplicative and the gradient only uses the value of X/WH, thus this optimization holds for both algorithm.

Ok great, thanks for the clarification. I have no opinion on whether or not we should include the multiplicative updates optimizer (we probably would require at least one example where it fares better in some way than the existing projected gradient method) but we should definitely add support for loss=kullback_leibler with the sparsity optim.

@amueller
Copy link
Member

I agree with @ogrisel :)

@mblondel
Copy link
Member
mblondel commented Jan 8, 2014

@omangin Could you create a gist with your code and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets?

@omangin
Copy link
Author
omangin commented Feb 4, 2014

As suggested by @mblondel, I added the code as a third part code snipet. See the third party project page and the gist.

@mblondel
Copy link
Member
mblondel commented Feb 4, 2014

Thank you very much @omangin! I tweeted your gist https://twitter.com/mblondel_ml/status/430686434526130176

@sreenivasraghavan71
Copy link

Hi, I am to new to this contribution. I am interested in this topic NMF. Can someone give me some good algorithm pertaining to NMF which must be included? Please give me some topic so that I can start a bit early

@amueller
Copy link
Member

There is a comprehensive discussion here #4811 and here #4852. You could ask @TomDLT if he wants any help with #4852.

@amueller
Copy link
Member

closing in favor of #5295

@amueller amueller closed this Oct 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0