Discussion about adding NMF methods

NMF in scikit-learn overview :

Current implementation (code):
- loss = squared (aka Frobenius norm)
- method = projected gradient
- regularization = trick to enforce sparseness or low error with beta / eta

#1348 (or in gist):
- loss = (generalized) Kullback-Leibler divergence (aka I-divergence)
- method = multiplicative update
- regularization = None

#2540 [WIP]:
- loss = squared (aka Frobenius norm)
- method = multiplicative update
- regularization = None

#896 (or in gist):
- loss = squared (aka Frobenius norm)
- method = coordinate descent, no greedy selection
- regularization = L1 or L2

Papers describing the methods:
- Multiplicative update
- Projected gradient
- Coordinate descent with greedy selection

About the uniqueness of the results
The problem is non-convex, and there is no unique minimum:
Different losses, different initializations, and/or different optimization methods generally give different results !

About the methods

The multiplicative update (MU) is the most widely used because of it's simplicity. It is very easy to adapt it to squared loss, (generalized) Kullback-Leibler divergence or Itakura–Saito divergence, which are 3 specific cases of the so-called beta-divergence. All three losses seem used in practice. A regularization L1 or L2 can easily be added.
The Projected gradient (PG) seems very efficient for the squared loss, but does not scale well (w.r.t X size) for the (generalized) KL divergence. A L1 or L2 regularization could possibly be added in the gradient step. I don't know where the sparseness enforcement trick in current code comes from.
The Coordinate Descent (CD) seems even more efficient for squared loss, and we can add easily L1 or L2 regularization. It can be further speeded up by a greedy selection of coordinate. The adaptation for KL divergence is possible with a Newton method for solving subproblem (slower), but without greedy selection. This adaptation is supposed to be faster than MU-NMF with (generalized) KL divergence.

About the initialization
Different schemes exist, and can change significantly both result and speed. They can be used independantly for each NMF method.

About the stopping condition
Actual stopping condition in PG-NMF is bugged (#2557), and leads to poor minima when the tolerance is not low enough, especially in the random initialization scheme. It is also completely different from stopping condition in MU-NMF, which is very difficult to set. Talking with audio scientists (who use a lot MU-NMF for source seperation) reveals that they just set a number of iteration.

As far as I understand NMF, as there is no unique minimum, there is no perfect loss/method/initialization/regularization. A good choice for some dataset can be terrible for another one. I don't know how many methods we want to maintain in scikit-learn, and how much we want to guide users with few possibilities, but several methods seems more useful than only one.

I have tested MU-NMF, PG-NMF and CD-NMF from scikit-learn code, #2540 and #896, with squared loss and no regularization, on a subsample of 20news dataset, and performances are already very different depending on the initialization (see below).

Which methods do we want in scikit-learn?
Why do we have stopped #1348 or #896 ?
Do we want to continue #2540 ?
I can work on it as soon as we have decided.

NNDSVD (similar curves than NNDSVRAR)

NNDSVDA

Random run 1

Random run 2

Random run 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions