PERF Optimize dot product order #17684

rth · 2020-06-23T19:50:39Z

When multiplying 3 or more matrices, the order of parathesis doesn't impact the results but it can have a very significant impact on the number of operations and on performance see https://en.wikipedia.org/wiki/Matrix_chain_multiplication

For matrix multiplication of dense arrays there is numpy.linalg.multi_dot and we we should use it I think. To find existing occurrences where it could be used, see for instance the result of

git grep 'dot(.*dot'

sklearn/datasets/_samples_generator.py:    return np.dot(np.dot(u, s), v.T)
sklearn/datasets/_samples_generator.py:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), Vt)
sklearn/decomposition/_fastica.py:    w -= np.dot(np.dot(w, W[:j].T), W[:j])
sklearn/decomposition/_fastica.py:    return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
sklearn/decomposition/_fastica.py:                S = np.dot(np.dot(W, K), X).T
sklearn/decomposition/_nmf.py:            norm_WH = trace_dot(np.dot(np.dot(W.T, W), H), H)
sklearn/decomposition/_nmf.py:        denominator = np.dot(np.dot(W.T, W), H)
sklearn/discriminant_analysis.py:        self.coef_ = np.dot(self.means_, evecs).dot(evecs.T)
sklearn/gaussian_process/_gpc.py:            s_1 = .5 * a.T.dot(C).dot(a) - .5 * R.T.ravel().dot(C.ravel())
sklearn/gaussian_process/_gpc.py:            s_3 = b - K.dot(R.dot(b))  # Line 14
sklearn/linear_model/_bayes.py:            coef_ = np.dot(X.T, np.dot(
sklearn/linear_model/_logistic.py:        ret[:n_features] = X.T.dot(dX.dot(s[:n_features]))
sklearn/linear_model/_ridge.py:        AXy = A.dot(X_op.T.dot(y))

Ideally each replacement should be benchmarked.

For matrix multiplication safe_sparse_dot using a combination of sparse and dense matrice, some of this could apply as well, though defining a general heuristic is probably a bit more difficult there.

The text was updated successfully, but these errors were encountered:

postmalloc · 2020-06-25T10:03:53Z

Turns out multi_dot can be slower^[1] than dot depending on the size and variation in the size of the arrays. However, multi_dot uses a much simpler logic to identify the right order if the dot product is on 3 matrices^[2]. Considering that most of the nested dot products in the code seem to have 3 matrices, maybe multi_dot can provide performance gains.

[1] https://stackoverflow.com/questions/45852228/how-is-numpy-multi-dot-slower-than-numpy-dot
[2] https://github.com/numpy/numpy/blob/94721320b1e13fd60046dc8bd0d343c54c2dd2e9/numpy/linalg/linalg.py#L2664

rth · 2020-06-25T11:53:57Z

Yes, indeed, that's why benchmarking is useful in any case.

postmalloc · 2020-06-25T17:09:00Z

I replaced the the nested dot products in a few modules with multi_dot without further inspection, and did a basic speed comparison on a single core machine. The dataset is fairly large.

The code in my Jupyter notebook looks something like this:

from sklearn.datasets import load_digits
from sklearn.decomposition import FastICA
X, y = load_digits(return_X_y=True)
X.shape
>>> (1797, 64)

transformer = FastICA(n_components=7, random_state=0)

%%timeit -r 20
X_transformed = transformer.fit_transform(X)

from sklearn import linear_model
clf = linear_model.BayesianRidge()

%%timeit -r 20
clf.fit(X, y)

# Try with a bigger dataset
X = np.tile(X, (8, 8))
y = np.tile(y, 8)
X.shape, y.shape
>>> ((14376, 512), (14376,))

Module	dot	multi_dot
FastICA	229 ms ± 6.57 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	240 ms ± 10.7 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
FastICA (large)	6.59 s ± 307 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	6.62 s ± 181 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF	212 ms ± 5.64 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	220 ms ± 12 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF (large)	10.6 s ± 112 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	10.8 s ± 173 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
BayesianRidge	13.4 ms ± 350 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)	13.7 ms ± 489 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)
BayesianRidge (large)	2.6 s ± 65 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	2.38 s ± 114 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

rth · 2020-06-25T19:29:08Z

Thanks for checking @d3b0unce! So it might be worth it for BayesianRidge. Please feel free to open a PR. It could also be worth benchmarking ridge maybe on a dataset with something like 20k samples, 30 features which would be more typical.

For the first occurrence in _samples_generator I did see a small performance improvement when I benchmarked it earlier.

postmalloc · 2020-06-26T07:55:48Z

Hi @rth, I tested again with a (20000, 30) dataset, on a multi-core machine.

The dataset was created like this

np.random.seed(0)
X = np.random.rand(20000, 30)
y = np.random.rand(20000)

multi_dot seems to be faster or on-par with nested dot products for these classes. We can use multi_dot for all these I think.

Should I commit the changes for the classes below as a part of PR #17737?

Class	dot	multi_dot
FastICA	313 ms ± 34.5 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	318 ms ± 73 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
NMF	3.33 s ± 170 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)	3.19 s ± 133 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
BayesianRidge	71.1 ms ± 11.7 ms per loop (mean ± std. dev. of 100 runs, 10 loops each)	63.3 ms ± 13.3 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
ARDRegression	46.7 ms ± 9.7 ms per loop (mean ± std. dev. of 100 runs, 10 loops each)	39.7 ms ± 7.28 ms per loop (mean ± std. dev. of 100 runs, 10 loops each)

rth · 2020-06-26T08:58:45Z

Great!

Should I commit the changes for the classes below as a part of PR

Yes, please it would be easier for reviewers to evaluate if it's in one branch..

jeremiedbb · 2020-06-26T09:41:15Z

We should benchmark when using sparse data as well. btw do you understand this ?

import numpy as np 
from scipy.sparse import random 
X = random(m=10, n=100) 
Y = random(m=100, n=100) 
Z = random(m=100, n=10) 
np.linalg.multi_dot([X, Y, Z])

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-2-dc6b8c3b5207> in <module>
      4 Y = random(m=100, n=100)
      5 Z = random(m=100, n=10)
----> 6 np.linalg.multi_dot([X, Y, Z])

<__array_function__ internals> in multi_dot(*args, **kwargs)

~/miniconda/envs/dev2/lib/python3.8/site-packages/numpy/linalg/linalg.py in multi_dot(arrays)
   2658     if arrays[-1].ndim == 1:
   2659         arrays[-1] = atleast_2d(arrays[-1]).T
-> 2660     _assert_2d(*arrays)
   2661 
   2662     # _multi_dot_three is much faster than _multi_dot_matrix_chain_order

~/miniconda/envs/dev2/lib/python3.8/site-packages/numpy/linalg/linalg.py in _assert_2d(*arrays)
    198     for a in arrays:
    199         if a.ndim != 2:
--> 200             raise LinAlgError('%d-dimensional array given. Array must be '
    201                     'two-dimensional' % a.ndim)
    202 

LinAlgError: 0-dimensional array given. Array must be two-dimensional

rth · 2020-06-26T10:00:34Z

We should benchmark when using sparse data as well.

multi_dot only applies to dense data I think. All of the places where we use multi_dot should use dense data exclusively, otherwise its heuristic based on shapes doesn't make sense.

btw do you understand this ?

Yeah, I guess it doesn't work on sparse, and doesn't raise a clean exception either.

postmalloc · 2020-06-26T11:26:31Z

All of the places where we use multi_dot should use dense data exclusively

Yes, that makes sense. I got these errors when I changed them in the wrong places, such as in _ridge.py.

Yes, please it would be easier for reviewers to evaluate if it's in one branch..

I pushed the changes for FastICA, NMF, BayesianRidge, and ARDRegression in #17737. Do you want the changes for other modules that have not been benchmarked yet to go in a separate PR?

rth · 2020-06-26T11:39:08Z

Do you want the changes for other modules that have not been benchmarked yet to go in a separate PR?

Sounds good. Let's merge that one first, I'll try to review in a day or so.

TomDLT · 2020-07-13T18:19:20Z

sklearn/datasets/_samples_generator.py
sklearn/decomposition/_fastica.py [MRG] Dot product optimizations using multi_dot #17737
sklearn/decomposition/_nmf.py [MRG] Dot product optimizations using multi_dot #17737
sklearn/discriminant_analysis.py
sklearn/gaussian_process/_gpc.py
sklearn/linear_model/_bayes.py [MRG] Dot product optimizations using multi_dot #17737
sklearn/linear_model/_logistic.py ENH Optimize dot product order for LogisticRegression for dense matrices #19571
sklearn/linear_model/_ridge.py

np.linalg.multi_dot has shown to be faster for the dot product of 3 matrices. See: scikit-learn#17684 (comment)

rth added the Performance label Jun 23, 2020

postmalloc mentioned this issue Jun 26, 2020

[MRG] Dot product optimizations using multi_dot #17737

Merged

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Feb 26, 2021

Replace chained np.dot with np.linalg.multi_dot

9d47681

np.linalg.multi_dot has shown to be faster for the dot product of 3 matrices. See: scikit-learn#17684 (comment)

jjerphan mentioned this issue Feb 26, 2021

ENH Optimize dot product order for LogisticRegression for dense matrices #19571

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF Optimize dot product order #17684

PERF Optimize dot product order #17684

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PERF Optimize dot product order #17684

PERF Optimize dot product order #17684

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!