8000 PERF Optimize dot product order · Issue #17684 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PERF Optimize dot product order #17684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rth opened this issue Jun 23, 2020 · 11 comments
Open

PERF Optimize dot product order #17684

rth opened this issue Jun 23, 2020 · 11 comments

Comments

@rth
Copy link
Member
rth commented Jun 23, 2020

When multiplying 3 or more matrices, the order of parathesis doesn't impact the results but it can have a very significant impact on the number of operations and on performance see https://en.wikipedia.org/wiki/Matrix_chain_multiplication

For matrix multiplication of dense arrays there is numpy.linalg.multi_dot and we we should use it I think. To find existing occurrences where it could be used, see for instance the result of

git grep 'dot(.*dot'
sklearn/datasets/_samples_generator.py:    return np.dot(np.dot(u, s), v.T)
sklearn/datasets/_samples_generator.py:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), Vt)
sklearn/decomposition/_fastica.py:    w -= np.dot(np.dot(w, W[:j].T), W[:j])
sklearn/decomposition/_fastica.py:    return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
sklearn/decomposition/_fastica.py:                S = np.dot(np.dot(W, K), X).T
sklearn/decomposition/_nmf.py:            norm_WH = trace_dot(np.dot(np.dot(W.T, W), H), H)
sklearn/decomposition/_nmf.py:        denominator = np.dot(np.dot(W.T, W), H)
sklearn/discriminant_analysis.py:        self.coef_ = np.dot(self.means_, evecs).dot(evecs.T)
sklearn/gaussian_process/_gpc.py:            s_1 = .5 * a.T.dot(C).dot(a) - .5 * R.T.ravel().dot(C.ravel())
sklearn/gaussian_process/_gpc.py:            s_3 = b - K.dot(R.dot(b))  # Line 14
sklearn/linear_model/_bayes.py:            coef_ = np.dot(X.T, np.dot(
sklearn/linear_model/_logistic.py:        ret[:n_features] = X.T.dot(dX.dot(s[:n_features]))
sklearn/linear_model/_ridge.py:        AXy = A.dot(X_op.T.dot(y))

Ideally each replacement should be benchmarked.

For matrix multiplication safe_sparse_dot using a combination of sparse and dense matrice, some of this could apply as well, though defining a general heuristic is probably a bit more difficult there.

@postmalloc
Copy link
Contributor
postmalloc commented Jun 25, 2020

Turns out multi_dot can be slower[1] than dot depending on the size and variation in the size of the arrays. However, multi_dot uses a much simpler logic to identify the right order if the dot product is on 3 matrices[2]. Considering that most of the nested dot products in the code seem to have 3 matrices, maybe multi_dot can provide performance gains.

[1] https://stackoverflow.com/questions/45852228/how-is-numpy-multi-dot-slower-than-numpy-dot
[2] https://github.com/numpy/numpy/blob/94721320b1e13fd60046dc8bd0d343c54c2dd2e9/numpy/linalg/linalg.py#L2664

@rth
Copy link
Member Author
rth commented Jun 25, 2020

Yes, indeed, that's why benchmarking is useful in any case.

@postmalloc
Copy link
Contributor
postmalloc commented Jun 25, 2020

I replaced the the nested dot products in a few modules with multi_dot without further inspection, and did a basic speed comparison on a single core machine. The dataset is fairly large.

The code in my Jupyter notebook looks something like this:

from sklearn.datasets import load_digits
from sklearn.decomposition import FastICA
X, y = load_digits(return_X_y=True)
X.shape
>>> (1797, 64)

transformer = FastICA(n_components=7, random_state=0)

%%timeit -r 20
X_transformed = transformer.fit_transform(X)

from sklearn import linear_model
clf = linear_model.BayesianRidge()

%%timeit -r 20
clf.fit(X, y)

# Try with a bigger dataset
X = np.tile(X, (8, 8))
y = np.tile(y, 8)
X.shape, y.shape
>>> ((14376, 512), (14376,))
Module dot multi_dot
FastICA 229 ms ± 6.57 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 240 ms ± 10.7 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
FastICA (large) 6.59 s ± 307 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 6.62 s ± 181 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF 212 ms ± 5.64 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 220 ms ± 12 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF (large) 10.6 s ± 112 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 10.8 s ± 173 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
BayesianRidge 13.4 ms ± 350 µs per loop (mean ± std. dev. of 20 runs, 100 loops each) 13.7 ms ± 489 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)
BayesianRidge (large) 2.6 s ± 65 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 2.38 s ± 114 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

@rth
Copy link
Member Author
rth commented Jun 25, 2020

Thanks for checking @d3b0unce! So it might be worth it for BayesianRidge. Please feel free to open a PR. It could also be worth benchmarking ridge maybe on a dataset with something like 20k samples, 30 features which would be more typical.

For the first occurrence in _samples_generator I did see a small performance improvement when I benchmarked it earlier.

@postmalloc
Copy link
Contributor

Hi @rth, I tested again with a (20000, 30) dataset, on a multi-core machine.

The dataset was created like this

np.random.seed(0)
X = np.random.rand(20000, 30)
y = np.random.rand(20000)

multi_dot seems to be faster or on-par with nested dot products for these classes. We can use multi_dot for all these I think.

Should I commit the changes for the classes below as a part of PR #17737?

Class dot multi_dot
FastICA 313 ms ± 34.5 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 318 ms ± 73 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF 3.33 s ± 170 ms per loop (mean ± std. dev. of 20 runs, 1 loop each) 3.19 s ± 133 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
BayesianRidge 71.1 ms ± 11.7 ms per loop (mean ± std. dev. of 100 runs, 10 loops each) 63.3 ms ± 13.3 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
ARDRegression 46.7 ms ± 9.7 ms per loop (mean ± std. dev. of 100 runs, 10 loops each) 39.7 ms ± 7.28 ms per loop (mean ± std. dev. of 100 runs, 10 loops each)

@rth
Copy link
Member Author
rth commented Jun 26, 2020

Great!

Should I commit the changes for the classes below as a part of PR

Yes, please it would be easier for reviewers to evaluate if it's in one branch..

@jeremiedbb
Copy link
Member

We should benchmark when using sparse data as well. btw do you understand this ?

import numpy as np 
from scipy.sparse import random 
X = random(m=10, n=100) 
Y = random(m=100, n=100) 
Z = random(m=100, n=10) 
np.linalg.multi_dot([X, Y, Z])     
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-2-dc6b8c3b5207> in <module>
      4 Y = random(m=100, n=100)
      5 Z = random(m=100, n=10)
----> 6 np.linalg.multi_dot([X, Y, Z])

<__array_function__ internals> in multi_dot(*args, **kwargs)

~/miniconda/envs/dev2/lib/python3.8/site-packages/numpy/linalg/linalg.py in multi_dot(arrays)
   2658     if arrays[-1].ndim == 1:
   2659         arrays[-1] = atleast_2d(arrays[-1]).T
-> 2660     _assert_2d(*arrays)
   2661 
   2662     # _multi_dot_three is much faster than _multi_dot_matrix_chain_order

~/miniconda/envs/dev2/lib/python3.8/site-packages/numpy/linalg/linalg.py in _assert_2d(*arrays)
    198     for a in arrays:
    199         if a.ndim != 2:
--> 200             raise LinAlgError('%d-dimensional array given. Array must be '
    201                     'two-dimensional' % a.ndim)
    202 

LinAlgError: 0-dimensional array given. Array must be two-dimensional

@rth
Copy link
Member Author
rth commented Jun 26, 2020

We should benchmark when using sparse data as well.

multi_dot only applies to dense data I think. All of the places where we use multi_dot should use dense data exclusively, otherwise its heuristic based on shapes doesn't make sense.

btw do you understand this ?

Yeah, I guess it doesn't work on sparse, and doesn't raise a clean exception either.

@postmalloc
Copy link
Contributor

All of the places where we use multi_dot should use dense data exclusively

Yes, that makes sense. I got these errors when I changed them in the wrong places, such as in _ridge.py.

Yes, please it would be easier for reviewers to evaluate if it's in one branch..

I pushed the changes for FastICA, NMF, BayesianRidge, and ARDRegression in #17737. Do you want the changes for other modules that have not been benchmarked yet to go in a separate PR?

@rth
Copy link
Member Author
rth commented Jun 26, 2020

Do you want the changes for other modules that have not been benchmarked yet to go in a separate PR?

Sounds good. Let's merge that one first, I'll try to review in a day or so.

@TomDLT
Copy link
Member
TomDLT commented Jul 13, 2020

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Feb 26, 2021
np.linalg.multi_dot has shown to be faster for the dot
product of 3 matrices.

See:
scikit-learn#17684 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0