[cuBLAS] relax the restrictions on the use of cublasLt

@ptrblck

🚀 The feature, motivation and pitch

Hi team, while I used torch.addmm(input, mat1, mat2) (out-of-place operation), where all 3 inputs are of 2D tensor (i.e. matrix). I found out there was an explicit memcpy D2D because it fell back to non-cublasLt routine which only supports inplace operation.

See the nsys timeline: The D2D overhead would be nearly half of the actuall gemm .

As for a conventional linear with bias, there is no such D2D.

The restriction of using cublasLt are listed here. Only when beta = 1 and input is 1D tensor, the cublasLt is activated. However, according to the latest cublasLt doc:

This function supports both in-place matrix multiplication (C == D and Cdesc == Ddesc) and out-of-place matrix multiplication (C != D, both matrices must have the same data type, number of rows, number of columns, batch size, and memory order). In the out-of-place case, the leading dimension of C can be different from the leading dimension of D. Specifically the leading dimension of C can be 0 to achieve row or column broadcast. If Cdesc is omitted, this function assumes it to be equal to Ddesc.

There are no restrictions to beta and the input dims. So I wondered if you can relax the condition to enable cublasLt. Thanks!

Alternatives

No response

Additional context

No response

cc @ptrblck @msaroufim @eqy @jerryzh168 @csarofeen @xwang233 @jianyuh @nikitaved @mruberry @walterddr @lezcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions