-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Discussion about adding NMF methods #4811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for this great summary. Your evaluation doesn't include greedy selection, right? @mblondel any reason you didn't implement that? What are other state-of-the-art implementations? Also, I'd like to see how the methods perform on different datasets. I think sparse data is more common with NMF, but it would be good to have some results on dense data, too. Maybe MNIST, maybe audio? I'm not sure what good datasets are. There are two face datasets that are used in the papers. Can we use them? Or maybe lfw or olivetti? |
Scattered thoughts: We could try to pick up #2557 and replace the projected gradient solver with the one from this notebook that solves the regularization/sparseness awkwardness, to give the PG-NMF method a fair trial, and see if CD consistently outperforms it (for tall & wide, dense & sparse data) Could you explain the bit about projected gradient scaling worse when KL divergence loss is used? I'm a bit worried that the different inits lead to such different results. I wonder if the solutions are all equally "good". Maybe we could stick a classifier after it? I agree with @amueller we should bench this on text (20newsgroups) and faces or MNIST. |
Ok, I am working on more benchmarking, including dense and large datasets.
I read it in this paper, but it is not very detailed. Projected gradient's original paper does not deal with KL divergence since it is not well defined in 0. |
It seems actually that the Yang et al paper says that projected gradient has problems with I-divergence (generalized KL?) but not with normalized KL, and (quoting from page 7)
Is I-divergence really that slow to converge with gradient-based methods? Maybe we should try. @mblondel do you have any intuition? |
this might also be relevant: http://www.jmlr.org/proceedings/papers/v28/kumar13b.pdf they solve it exactly under certain conditions. Also, using ADMM should work fairly well and is easy to implement. It should be much faster than projected gradient descent. via @bmcfee |
Also, using ADMM should work fairly well and is easy to implement. It
should be much faster than projected gradient descent. via @bmcfee
In my lab, we have had very bad experiences with ADMM: its convergence is
sensitive to the rho parameter, and sometimes it does not converge (the
acceptable values of rho typically depend on the input data).
|
You need to dynamically scale the rho. Maybe you just need an ADMM expert ;) |
I have no idea, I'm just channeling @bmcfee and adding snarkiness btw. |
That didn't work for us. On two different problems (covariance estimation We almost got it to work: it worked 99% of the time. But the remaining 1% |
Well, tell @bmcfee that the experts hide their knowledge well enough to make sure that noone benefits from it :). But rumors do say that setting rho in an ADMM can be a problem. |
Did you use my implementation for CD? Ideally it needs to be Cythonified to obtain the best performance. Currently, it uses Numba. Did you use the same initialization for each method? I am surprised that CD seems to get stuck in a worse local optimum than other methods. Or maybe this is just the scale of your plot. Could you share your script? Are the regularization parameters the same? Our projected gradient implementation has a different interface. BTW, we also need to decide whether we want to simplify the interface (like in my implementation). Use the entire news20 dataset. As the dataset grows, I would expect CD to win. ADMM is nice for complex regularizers. Non-negativity constraints is pretty easy to handle without resorting to ADMM. |
my colleague recommends you have a look at:
http://arxiv.org/pdf/1401.5226v2.pdf
it contains benchmarks of NMF solvers on various problems.
|
More benchmarking and more details
I will change regularization of PG-NMF, using this notebook, and do more benchmarking with regularization. |
It's nice to see that Numba is on par with Cython. The Cython version will be useful in any case since we cannot depend on Numba yet. |
yeah, that looks pretty good. |
Any idea why there's no more obvious difference in the loss attained by PG and CD? Based on these results I would just keep the CD solver. I would really like this to be the case, it would make everything so much simpler. Adding regularization might give us more insight. Another dimension not varied across these plots is API questions about regularization:
EDIT: or, easier to use and gridsearch but less expressive: |
With KL we can reach audio people
|
I wouldn't mind KL (I-divergence?) for text either, since it corresponds to pLSI. But frobenius seems to lead to good topics too. |
There is at least one state-of-the-art NMF algorithm that hasn't been discussed yet: Kim and Park's Block Principle Pivoting method. https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_sisc_nmf.pdf This is a block coordinate descent approach for solving the alternating nonnegative least squares (ANLS) subproblems under the Frobenius norm. It basically speeds up the old active-set methods. It is straight-forward to implement, and in their paper Kim and Park demonstrate, on a variety of datasets (faces and text), that it performs better, both in convergence time and relative reconstruction error, than the MU and PG algorithms already mentioned in this discussion, and about the same as the CD method already mentioned. Interestingly, when used to solve the ANLS subproblems in nonnegative tensor factorization, this algorithm performs better than the CD method (see https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_hpscbook_ntf.pdf). Python code for this algorithm already exists, although not in the scikit-learn API style: https://github.com/kimjingu/nonnegfac-python. Definitely worth considering. |
Interesting. As scikit-learn doesn't deal with tensors, that sounds like further support of the CD solver :) |
Seems to me it argues for BPP since the two perform similarly on NMF, and when scikit-learn adds tensor support, the state-of-the-art ANLS solver for NTF will already be implemented. However, to be fair, it looks to me like the comparison done in the paper is to an unaccelerated version of the proposed CD solver (see http://download.springer.com/static/pdf/974/art%253A10.1007%252Fs10898-013-0035-4.pdf). So for proper comparison it would need to be added to @TomDLT 's tests. |
There is currently no plan for that to ever happen. That's what I meant, no reason to go for the fancier newer method if we'll never even need its benefits. |
This is also my feeling. Having too many solvers makes things harder in case of refactoring. For instance, adding proper fit_intercept support in Ridge is harder because we have too many solvers to change. The CD code is also pretty simple and has a good stopping criterion.
We definitely do but let's do it one PR at a time. For now let's focus on the squared loss and the API :)
Maybe elastic net style for consistency with other modules? However, even only L2 regularization will lead to sparse solutions in practice. Regarding factor specific regularization, this sounds complicated. Maybe we can keep the class API simple and expose a function with more options? |
Could you quote the relevant sentence? One possible reason would be because the I-divergence doesn't have Lipschitz continuous first derivative (i.e. its second derivative is not bounded above). Most gradient-based solvers assume Lipschitz continuous gradient in their convergence proof. However, for CD, the authors were still able to prove convergence to a stationary point: |
From here. But the results in section 4 don't seem to me to really support the very strong language used above. |
So am I reading this right, and what the kdd'11 paper calls KL-divergence (equation 3) is actually Generalized KL-divergence or I-divergence, right? side-note: both of those are horrible names. What does the I in I-divergence stand for? This claims that (paraphrase) "it's generalized in the sense that KL-Div(a, b) = I-Div(a, b) if a and b are probability distributions." I-Div is also a Bregman divergence. |
Yes |
After talking with @agramfort and various audio colleagues, it appears that:
I suggest to implement both MM method (in order to have easily different losses) and CD method (which seems a very efficient method for the Frobenius norm, as we have seen in the benchmarks). In term of API, @vene and @mblondel suggested elastic net style (with Should we have several classes (MultiplicativeNMF, ProjectedGradientNMF, ...), or one NMF class and several solvers? How should we deal with current solver? with current regularization trick? |
I vote for one NMF class, different solvers. |
I think we are avoiding naming classes after solvers. One
Sorry, I'm not exactly sure what you mean by this. If we implement CD there's no reason to keep PG unless benchmarks back it up in certain settings.
Alternatively we could name the new class Does the maximization-minimization method support elastic-net penalty? Eqs. 73-74 give a solution for L1-penalty. If not, we could restrict the API to penalizing just with L1, and support the fully-expressive non-tied elastic net formulation in a |
There is also: The Diagonalized Newton Algorithm for Nonnegative Matrix Factorization by Hugo Van hamme (http://arxiv.org/abs/1301.3389) mentioned by @vene. |
In my comment I claimed that this algorithm was shown to outperform CD. It doesn't look like that is accurate, the comparisons are only against multiplicative updates. No idea why I said that. Maybe I thought CD meant something else? Sorry! |
Should we close as #4852 was merged? Or keep open to discuss adding more losses? |
Based on this discussion, I have worked on adding a multiplicative-update solver for NMF (more precisely Maximization-Minimization) in PR #5295. |
NMF in scikit-learn overview :
Current implementation (code):
- loss = squared (aka Frobenius norm)
- method = projected gradient
- regularization = trick to enforce sparseness or low error with beta / eta
#1348 (or in gist):
- loss = (generalized) Kullback-Leibler divergence (aka I-divergence)
- method = multiplicative update
- regularization = None
#2540 [WIP]:
- loss = squared (aka Frobenius norm)
- method = multiplicative update
- regularization = None
#896 (or in gist):
- loss = squared (aka Frobenius norm)
- method = coordinate descent, no greedy selection
- regularization = L1 or L2
Papers describing the methods:
- Multiplicative update
- Projected gradient
- Coordinate descent with greedy selection
About the uniqueness of the results
The problem is non-convex, and there is no unique minimum:
Different losses, different initializations, and/or different optimization methods generally give different results !
About the methods
About the initialization
Different schemes exist, and can change significantly both result and speed. They can be used independantly for each NMF method.
About the stopping condition
Actual stopping condition in PG-NMF is bugged (#2557), and leads to poor minima when the tolerance is not low enough, especially in the random initialization scheme. It is also completely different from stopping condition in MU-NMF, which is very difficult to set. Talking with audio scientists (who use a lot MU-NMF for source seperation) reveals that they just set a number of iteration.
As far as I understand NMF, as there is no unique minimum, there is no perfect loss/method/initialization/regularization. A good choice for some dataset can be terrible for another one. I don't know how many methods we want to maintain in scikit-learn, and how much we want to guide users with few possibilities, but several methods seems more useful than only one.
I have tested MU-NMF, PG-NMF and CD-NMF from scikit-learn code, #2540 and #896, with squared loss and no regularization, on a subsample of 20news dataset, and performances are already very different depending on the initialization (see below).
Which methods do we want in scikit-learn?
Why do we have stopped #1348 or #896 ?
Do we want to continue #2540 ?
I can work on it as soon as we have decided.
NNDSVD (similar curves than NNDSVRAR)





NNDSVDA
Random run 1
Random run 2
Random run 3
The text was updated successfully, but these errors were encountered: