-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Description
🚀 The feature, motivation and pitch
Hi team, while I used torch.addmm(input, mat1, mat2)
(out-of-place operation), where all 3 inputs are of 2D tensor (i.e. matrix). I found out there was an explicit memcpy D2D because it fell back to non-cublasLt routine which only supports inplace operation.
See the nsys timeline: The D2D overhead would be nearly half of the actuall gemm .
As for a conventional linear with bias, there is no such D2D.
The restriction of using cublasLt are listed here. Only when beta = 1 and input is 1D tensor, the cublasLt is activated. However, according to the latest cublasLt doc:
This function supports both in-place matrix multiplication (C == D and Cdesc == Ddesc) and out-of-place matrix multiplication (C != D, both matrices must have the same data type, number of rows, number of columns, batch size, and memory order). In the out-of-place case, the leading dimension of C can be different from the leading dimension of D. Specifically the leading dimension of C can be 0 to achieve row or column broadcast. If Cdesc is omitted, this function assumes it to be equal to Ddesc.
There are no restrictions to beta and the input dims. So I wondered if you can relax the condition to enable cublasLt. Thanks!
Alternatives
No response
Additional context
No response
cc @ptrblck @msaroufim @eqy @jerryzh168 @csarofeen @xwang233 @jianyuh @nikitaved @mruberry @walterddr @lezcano