-
-
Notifications
You must be signed in to change notification settings - Fork 26k
NMF for Kullback-Leibler divergence #1348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now computes the real generalized_KL even for sparse matrices. Also fix and extends test.
- Fixes NMF random test that was not reliable enough. - Adds flag to force parameter scaling in update when not in fit mode. - Cosmetics.
- better import in KL NMF test, - adds random state to NMF init params.
Hi @omangin. So basically this one minimizes KL-Divergence while the current implementation minimizes MSE? Is there a qualitative difference between the results of this algorithm and the current implementation? Best, |
Btw if possible it might be nice to have an example to illustrate the difference between the two nmf methods. |
Hi, Thanks for your comments. This is the original implementation from Lee and Seung's Nature paper. It differs from the current implementation both because the objective it optimizes is based on Kullback-Leibler divergence and not Frobenius distance / MSE. It is also different because of the nature of the multiplicative updates it uses, instead of non-legative least square (NNLS) for the current implementation. So, as far as I know there are multiplicative and gradient updates for both KL and Frobenius objective, whereas NNLS is specific to the Frobenius case. From what I've heard (I will look back to the references when I have time), gradient based approaches are considered more stable and to converge to better local minima. On the other hand, multiplicative updates have faster initial convergence speed, and are also somehow simpler to implement because there is no need for step size adaptation. However this might change between the KL and Frobenius cases, and also the computational cost is not the same in the dense and sparse cases. The choice between KL and Frobenius objectives is mainly problem dependent. I've included an example on the face dataset but it is not really illustrative of the difference. Furthermore I noticed that both methods are very sensible to initialization. Also the initialization methods currently are the same than in the previous implementation bu they might not be as relevant for the KL case as they were for the Frobenius cas. Perhaps I can also look for a better illustrative example. Best, Olivier |
It is important to understand the tradeoffs and to explain them, with
That would be great! Thanks a lot |
As far as I know the Lee and Seung paper doesn't have convergence guarantees. Both because of that and for integration with existing code, I think it would be preferable to implement the KL-divergence using the projected-gradient method. This way you just need to replace the gradient computations in the algorithm that we already have (we could add a The coordinate descent algorithm I mentioned in #896 is also pretty nice. It's basically like projected gradient except that it optimizes only one variable at a time. It selects variables greedily in the squared loss case and cyclically in the KL-divergence case. |
one benefit of multiplicative updates is that it applies out of the box to complex valued data such as used for audio and time series processing. It might be interesting to bench multiplicative updates and gradient / coordinate descent. Maybe there are some regimes for which multiplicative updates are faster. |
From what I recall there was no clean-cut advantage, and we were thinking of implementing multiplicative updates too, at some point (note that the NMF class is just a wrapper around ProjectedGradientNMF because we thought ahead). I would suggest renaming the class in this PR to Thanks for pinging me @ogrisel, I completely missed this thread. |
+1 to @vene's suggestion but |
Please: |
|
I had the same impression from the few comparisons I did in the past.
Seems interesting, I'll have a deeper look at it.
I thought about it. That would actually be a quite straightforward modification. The only issue I see is that the updates in the KL case can be optimized when the data matrix is sparse. It is not guarenteed that the same optimization is possible for the Frobenius case. Anyway, it is probably not an issue if the doc clearly explains that changing the 'loss' parameter can change a lot the complexity of the underlying algorithm. About the convergence guarentees, as far as I know there is no definitive answer for the multiplicative updates. Some aspects are studied in http://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf for the square loss. It is probably not a lot of work to include both updates methods for both losses. However the purpose of sciki-learn may not be to include every possible method... Sorry for the superficial answers. I don't have a lot of time currently for deeper investigations. However all these question are very relevant for me so I will definitely (re-)check my bibliography and bench a few things later. |
Thanks for taking the comments into account @omangin. There is really no hurry. |
We can also raise a |
In that case I would support only projected gradient. As I said earlier, adding KL-divergence is just a matter of adding the gradient. All the rest should remain the same. |
Would @omangin's sparse data optimization for the KL loss be transferable to projected gradients? |
I think @omangin was referring to the fact that it might not be possible to handle the frobenius norm efficiently in the case of multiplicative updates. I can't comment as I'm only familiar with projected gradient. I was talking about adding KL-divergence to the projected gradient implementation we already have. I see no reason it cannot support sparse data efficiently. |
The answer is yes. The optimization is actually just grounded on the fact that the multiplicative update for KL uses the element-wise division (X / WH), X being the data matrix, and WH the learnt factorization. In the case where X is sparse and big and W and H are not, it is very long to compute (and sometime not possible to store in memory) the product WH (which by default is dense). Since only the element-wise division is needed by the algorithm it can be computed only where the coefficients in X are nonzero. Both the multiplicative and the gradient only uses the value of X/WH, thus this optimization holds for both algorithm. |
Ok great, thanks for the clarification. I have no opinion on whether or not we should include the multiplicative updates optimizer (we probably would require at least one example where it fares better in some way than the existing projected gradient method) but we should definitely add support for |
I agree with @ogrisel :) |
@omangin Could you create a gist with your code and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets? |
As suggested by @mblondel, I added the code as a third part code snipet. See the third party project page and the gist. |
Thank you very much @omangin! I tweeted your gist https://twitter.com/mblondel_ml/status/430686434526130176 |
Hi, I am to new to this contribution. I am interested in this topic NMF. Can someone give me some good algorithm pertaining to NMF which must be included? Please give me some topic so that I can start a bit early |
closing in favor of #5295 |
Adds the KLdivNMF class to decomposion.
The implemented method uses multiplicative updates. It is optimized to work with sparse matrices with lots of features.