-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] REFACTOR: dict_learning and dict_learning_online #9036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@agramfort: could you review this ? Thanks. |
I'm looking at this, unfortunately I have zero memory of the old code, I need to re-read it and figure things out. Depending on how simple you think the changeset is, we might want to split it into the projection bugfix and the refactoring. |
Data matrix. | ||
|
||
code : array of shape (n_components, n_samples) | ||
A : array of shape (n_components, n_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the old version of this just plain wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, I see what is happening. This needs some documentation in terms of what this function is solving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry, I don't understand the math in here. Would you mind explaining the changes a bit? The shapes are not the same as before. If the math does the same thing, it is not obvious to me why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please be precise about what part of the computations you don't understand. Is it normalization by A[k, k], etc ?
Also, in a moment I'll push a modified docstring with more detailed comments in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a more detailed comment below after thinking about it. No, the normalization makes sense to me. It just seems like the old computation does not correspond to what is in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to be clear, the old code is what is confusing here, not yours!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, indeed it looks like the old _dict_update
code is incorrect in the online case, due to the extra dictionary[:, k] = np.dot(R, A[k, :].T)
on line 357.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does A
denote atoms? if so can we just have it as atoms
?
dictionary[:, k] /= A[k, k] | ||
_proj_l2(dictionary[:, k]) | ||
# R <- -1.0 * U_k * V_k^T + R | ||
R = ger(-1.0, dictionary[:, k], A[k, :], a=R, overwrite_a=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This went down one level of indentation, is it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional. The old indentation appeared incorrect to me: it would lead to incorrect modifications (at least unjustified) in R, whenever a random atom was inserted into the dictionary (in case the old one was close to zero). Agreed ?
# R <- -1.0 * U_k * V_k^T + R | ||
R = ger(-1.0, dictionary[:, k], code[k, :], a=R, overwrite_a=True) | ||
dictionary[:, k] /= A[k, k] | ||
_proj_l2(dictionary[:, k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your fix of the l2 projection here agrees with eqn 10 in http://www.di.ens.fr/sierra/pdfs/icml09.pdf. Nice catch!
R = ger(1.0, dictionary[:, k], code[k, :], a=R, overwrite_a=True) | ||
dictionary[:, k] = np.dot(R, code[k, :].T) | ||
R = ger(1.0, dictionary[:, k], A[k, :], a=R, overwrite_a=True) | ||
dictionary[:, k] = np.dot(R, A[k, :]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried working through this on paper and I think the batch update ends up different by a factor of CCt
.
The old implementation seems incorrect to me; your implementation corresponds to the online update in the paper, applied over the entire dataset.
Previously we had:
R = X - DCt # dimensions here are transposed compared to sklearn conventions
D[:, k] = R * C[k] = (X - DCt) C[k]
Now, by forming A and B, we have
R = B - DA = XC - DCtC = (X - DCt)C
D[:, k] = R Ak = R CtC[k] = (X - DCt) CCt C[k]
Am I missing something or was this indeed a bug previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check and indeed I seem to get slightly different dictionary atoms before and after this PR with the batch version. They seem equal at the first decimal but differ at the second or the third, so they're close.
Your current implementation seems to match the paper correctly, at least in the online case. Unfortunately, the dict_learning_online function doesn't print the obj value at each iteration, so I can't compare that easily, but for the old implementation the objective does seem to decrease at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even with n_iter=10000 and tol=1e-20 a difference persists. I computed the residual by hand with
import numpy as np
rng = np.random.RandomState(0)
X = rng.randn(30, 20)
U, V, _ = dict_learning(X, n_components=3, alpha=0.1, random_state=0, verbose=2, max_iter=10000, tol=1e-20)
print(0.5 * np.sum((X - np.dot(U, V)) ** 2) + 0.1 * np.sum(np.abs(U)))
On master I get 188.766637, and on the current branch I get 188.785850, so slightly more...
Also it seems quite faster on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks like i actually introduced a bug: the line dictionary[:, k] = np.dot(R, A[k, :])
shouldn't be there anymore. This additional buggy computation should account for some of the speed loss you're experiencing. Thanks for the catch. Fixed
BTW, this is what I'm doing:
Minimize .5 * ||Xt - DC||_F^2 = .5 * tr(DtDA) - * tr(DtB) + .5 * tr(XXt),
as a function of the dictionary (V), where
B = XtCt, A = CCt, R = B - DA.
Note that the update of the jth atom is given by (see eqn 10 of
ref paper below):
BCD for kth atom:
R = R + D[:, k]A[k, :] # rank-1 update
D[:, k] = R[:, k] / A[k, k] # = D[:, k] + R[:, k] / A[k, k]
D[:, k] = proj(D[:, k]) # eqn 10
R = R - D[:, k]A[k, :] # rank-1 update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked with with your code (modified)
import time
import numpy as np
from sklearn.decomposition.dict_learning import dict_learning
rng = np.random.RandomState(0)
X = rng.randn(30, 20)
t0 = time.time()
U, V, _ = dict_learning(X, n_components=3, alpha=0.1, random_state=0,
verbose=2, max_iter=10000, tol=1e-20)
print(0.5 * np.sum((X - np.dot(U, V)) ** 2) + 0.1 * np.sum(np.abs(U)))
print("Time elapsed: %.2f" % (time.time() - t0))
On master:
Iteration 647 (elapsed time: 16s, 0.3mn, current cost 198.777)
Iteration 648 (elapsed time: 16s, 0.3mn, current cost 198.777)
--- Convergence reached after 648 iterations
198.776663706
Time elapsed: 16.46
On PR:
--- Convergence reached after 647 iterations
Learning code... done (total time: 10s, 0.2mn)
198.776663706
Time elapsed: 10.38
Thus the results (and trajectory of cost function, convergence point, etc.) are exactly the same, but my PR is faster than the master branch. Can you confirm. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm that your PR is faster and reaches the same error.
With tol=1e-20 the number of iterations needed for convergence fluctuates (both on master and here), but I think (although odd) it's a numerical issue. With reasonable tol 1e-6 the number of iterations is exactly the same. The returned U and V have the same norm as well, so I think you're right in saying the paths are identical.
master
[...]
Iteration 166 (elapsed time: 3s, 0.1mn, current cost 198.781)
--- Convergence reached after 166 iterations
198.780858248
Time elapsed: 3.13
U norm 14.4716684728
V norm 1.73205080757
here
[...]
Iteration 150 (elapsed time: 1s, 0.0mn)
--- Convergence reached after 166 iterations
Learning code... done (total time: 1s, 0.0mn)
198.780765235
Time elapsed: 1.99
U norm 14.4722062044
V norm 1.73205080757
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the dctionary is the same, but the code is a bit different now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master
Iteration 1 (elapsed time: 0s, 0.0mn, current cost 279.778010084)
Iteration 2 (elapsed time: 0s, 0.0mn, current cost 199.544246461)
Iteration 3 (elapsed time: 0s, 0.0mn, current cost 199.522729734)
[...]
Iteration 165 (elapsed time: 2s, 0.0mn, current cost 198.781266815)
Iteration 166 (elapsed time: 2s, 0.0mn, current cost 198.781055910)
--- Convergence reached after 166 iterations
198.780858248
Time elapsed: 2.13
U norm 14.4716684728
V norm 1.73205080757
head
Iteration 0 (elapsed time: 0s, 0.0mn, current cost 0.000000000
Iteration 1 (elapsed time: 0s, 0.0mn, current cost 279.778014383
Iteration 2 (elapsed time: 0s, 0.0mn, current cost 199.544247482
Iteration 3 (elapsed time: 0s, 0.0mn, current cost 199.522729073
[...]
Iteration 165 (elapsed time: 2s, 0.0mn, current cost 198.781266718
Iteration 166 (elapsed time: 2s, 0.0mn, current cost 198.781055820
--- Convergence reached after 166 iterations
Learning code... done (total time: 2s, 0.0mn)
198.780765048
Time elapsed: 2.16
U norm 14.4722067347
V norm 1.73205080757
I think the paths are indeed the same up to numerical precision. Other than that, with the new version, the code is updated once more at the end. (previously, in the batch mode the dict was the last block updated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'd expect things to be numerically a bit different due to at least one of the following
- the modified projection formula
- the indentation issue on the second call to
ger(...)
I'd have proposed to open a separate issue on these things and then address them in a separate PR, but then this would be a really minor change (in terms of patch size). So i'd rather address them as part of this PR, except there are compelling reasons to do otherwise. What do you think ?
|
||
.5 * ||Xt - DC||_F^2 = .5 * tr(DtDA) - 2 * tr(DtB) + tr(XXt), | ||
|
||
as a function of the dictionary (V), where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dictionary D?
We should get the variable names consistent throughout. I would not be opposed to using C/D for code/dictionary in the whole file, and B/A with the meanings from the paper.
theta = float(batch_size ** 2 + ii + 1 - batch_size) | ||
beta = (theta + 1 - batch_size) / (theta + 1) | ||
|
||
beta = 0. | ||
A *= beta | ||
A += np.dot(this_code, this_code.T) | ||
B *= beta | ||
B += np.dot(this_X.T, this_code.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to move the moving average update inside the if online
case and do
else:
A = np.dot(...)
B = np.dot(...)
just because it's less subtle
else: | ||
return dictionary.T | ||
return dictionary.T, errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately this is public API that we cannot break, we need to return the exact same signature as before. We can add a default argument (return_errors=False), but this logic is going to get messy
|
||
if n_jobs == -1: | ||
n_jobs = cpu_count() | ||
def _compute_residuals_from_code(X, V, U): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this helper is currently only called in one place, we could inline it
method='lars', iter_offset=0, random_state=None, | ||
return_inner_stats=False, inner_stats=None, | ||
return_n_iter=False): | ||
batch_size=None, verbose=False, shuffle=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size should stay 3 by default, otherwise this breaks the public API.
And even beyond the fact that the API breaks, current behavior is counterintuitive: the function name is "dict_learning_online", but it does batch learning with the default arguments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Fixed.
return_inner_stats=False, inner_stats=None, | ||
return_n_iter=False): | ||
batch_size=None, verbose=False, shuffle=True, | ||
n_jobs=1, method='lars', iter_offset=0, tol=0., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new tol argument is unused in the online case, right? This is confusing from an API standpoint.
Maybe a better idea is to have a private function _dict_learning
that does what you're currently doing here, and wrap it with dict_learning
and dict_learning_online
which implement the appropriate API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this lead to a messier code base due to the proliferation of *dict_learning*
functions ? Seems there might be a comprise to be made here.
A less invasive solution would be to enrich the documentation of the tol
param, explaining that it only takes effect in batch mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functions starting with "_" are not documented and not visible to the user normally, so it will be just a way for us to organize the im F987 plementation.
I think that's preferable to having parameters that only work some of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed. "dict_learning_online" no longer has "tol" param. Now, the APIs should be as before.
@@ -13,6 +13,7 @@ | |||
from sklearn.utils.testing import assert_raises | |||
from sklearn.utils.testing import ignore_warnings | |||
from sklearn.utils.testing import TempMemmap | |||
from sklearn.utils.testing import assert_almost_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
In batch mode the new implementation is nearly identical to the old, up to (I suspect) numerical precision and an extra fit of the code at the end. In online mode there are differences. This is on the digits dataset: code:
@dohmatob, I remember you mentioned yesterday that the old online version might have been incorrect, can you double-check the math? If you are right and can write a test that fails on master but passes here, it will motivate the differences in results and justify merging. |
e5dcffa
to
bb2cddc
Compare
OK, great. I'll look at it this afternoon. |
I've double checked the math in the dict update on master, and indeed: 1- The batch mode is correct Why there is an error on the master branch
where
Now, if we rewrite this in terms of the input data X and the codes C, we get:
Why the bug doesn't manifest ? |
BTW, i've fixed the "API unpacking" problem you complained about. The API |
random_state=None): | ||
"""Update the dense dictionary factor in place. | ||
def _proj_l2(v, copy=False): | ||
"""Projects v unto l2 unit ball in-place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> in-place (if copy=False)
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that's what happens when docstring and func prototype follow different evolutions and become out-of-sync. Good catch. Thx :)
def _update_dict(dictionary, Y, code, verbose=False, return_r2=False, | ||
random_state=None): | ||
"""Update the dense dictionary factor in place. | ||
def _proj_l2(v, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename it to _project_l2
|
||
B = XtCt, A = CCt, R = B - DA. | ||
|
||
Note that the update of the jth atom is given by (see eqn 10 of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean kth atom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, fixed. Thanks.
R = R + D[:, k]A[k, :] # rank-1 update | ||
D[:, k] = R[:, k] / A[k, k] # = D[:, k] + R[k] / A[k, k] | ||
D[:, k] = proj(D[:, k]) # eqn 10 | ||
R = R - D[:, k]A[k, :] # rank-1 update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note on what the different alphabets represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A, B are standard notation from the reference paper (and this notation is already used on master).
Please see eqn 11 of ref paper http://www.di.ens.fr/sierra/pdfs/icml09.pdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raghav makes a good point, we need to document A and B. No, A are not atoms, D[:, k] are the atoms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually scratch that. This is already documented at line 332, @raghavrv, is it easy to miss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could rename dictionary
to D in the entire file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how about instead making clear (by means of docstrings, comments, etc.) that D means dictionary (which is already the case). If we set out to rename D, we'd have to do something similar for code, this_code, etc. I think the labels D, C, etc. are only interesting when stating the formula being implemented (as done in the docstrings). But I may be wrong.
Data matrix. | ||
|
||
code : array of shape (n_components, n_samples) | ||
A : array of shape (n_components, n_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does A
denote atoms? if so can we just have it as atoms
?
This function has two modes: | ||
|
||
1. Batch mode, activated when batch_size is None or | ||
batch_size >= n_sampels. This is the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples
@@ -593,14 +582,18 @@ def dict_learning_online(X, n_components=2, alpha=1, n_iter=100, | |||
n_iter : int, | |||
Number of iterations to perform. | |||
|
|||
tol : float, | |||
Tolerance for the stopping condition. Used in batch mode. | |||
|
|||
return_code : boolean, | |||
Whether to also return the code U or just the dictionary V. | |||
|
|||
dict_init : array of shape (n_components, n_features), | |||
Initial value for the dictionary for warm restart scenarios. | |||
|
|||
callback : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callback : callable, optional default None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring issue you'r pointing to already exists in master. Fixed anyways.
584b955
to
5c39eb5
Compare
c4cd31e
to
c4b1b28
Compare
9339fc0
to
c4a2297
Compare
We could pose the bugfix as an optimization, because we avoid the spurious multiplication with a rotation matrix that was present before. Maybe this way @GaelVaroquaux will like it :) |
@@ -571,12 +577,19 @@ def dict_learning_online(X, n_components=2, alpha=1, n_iter=100, | |||
|
|||
(U^*, V^*) = argmin 0.5 || X - U V ||_2^2 + alpha * || U ||_1 | |||
(U,V) | |||
with || V_k ||_2 = 1 for all 0 <= k < n_components | |||
with || V_k ||_2 <= 1 for all 0 <= k < n_components | |||
|
|||
where V is the dictionary and U is the sparse code. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the U, V here is also inconsistent with D/C as used above, and with D/alpha as used in the paper... but consistent with other decompositions algos in sklearn. Maybe we should use U/V in the other docstrings and maybe in code too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
@@ -600,7 +613,8 @@ def dict_learning_online(X, n_components=2, alpha=1, n_iter=100, | |||
Initial value for the dictionary for warm restart scenarios. | |||
|
|||
callback : | |||
Callable that gets invoked every five iterations. | |||
Callable that gets invoked every five iterations. If it returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be customizable how often it gets invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the old logic. But yeah, I can add that to the PR.
@@ -600,7 +613,8 @@ def dict_learning_online(X, n_components=2, alpha=1, n_iter=100, | |||
Initial value for the dictionary for warm restart scenarios. | |||
|
|||
callback : | |||
Callable that gets invoked every five iterations. | |||
Callable that gets invoked every five iterations. If it returns | |||
non-True, then the main loop (iteration on data) is aborted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess the term is "falsish"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, ya sure
@vene: the spurious multiplication bug that we found, came as a by-product after looking at the code. My main objective with this PR was / is to collapse dict_learning and dict_learning_online into a single function (as they really should be!) :) Along the road, this fixes the spurious multiplication bug too. |
R = -np.dot(dictionary, code) | ||
R += Y | ||
R = -np.dot(dictionary, A) | ||
R += B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually faster than R = B - np.dot(dictionary, A)
? The latter is more readable imo
R = np.asfortranarray(R) | ||
ger, = linalg.get_blas_funcs(('ger',), (dictionary, code)) | ||
ger, = linalg.get_blas_funcs(('ger',), (dictionary, A)) | ||
for k in range(n_components): | ||
# R <- 1.0 * U_k * V_k^T + R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V_k -> A_k
U_k should be dictionary
or the other way around, as long as consistent in the entire file
Some refactoring has been done in #18975. Now |
Addresses issue #9009.
Proposal
Have a single function
dict_learning_online(batch_size=some_default)
which reproduces full-batch mode (the currentdict_learning
function) whenbatch_size == n_samples
.