10000 Feature: Batch matmul by dsyme · Pull Request #88 · DiffSharp/DiffSharp · GitHub
[go: up one dir, main page]

Skip to content

Feature: Batch matmul #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Oct 26, 2020
Merged

Feature: Batch matmul #88

merged 29 commits into from
Oct 26, 2020

Conversation

dsyme
Copy link
Collaborator
@dsyme dsyme commented Feb 26, 2020

Adjust MatMul to support batching and broadcasting, in both Tensor and the reference implementation

Builds on #85

@dsyme dsyme changed the base branch from dev to feature/expand February 26, 2020 15:11
@dsyme dsyme mentioned this pull request Feb 26, 2020
34 tasks
@dsyme dsyme changed the title Batch matmul Feature: Batch matmul Feb 27, 2020
Copy link
Member
@gbaydin gbaydin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transpose is missing tests in TestTensor.fs

I would like to implement the derivative tests to convince myself of the correctness. Let's add a todo comment in the TestDerivative.fs file to remember the missing test for the time being.

Also it would be great to have some Python reference code.

@dsyme
Copy link
Collaborator Author
dsyme commented Apr 22, 2020

@gbaydin Can we get this in do you think?

@dsyme dsyme changed the base branch from feature/expand to dev April 29, 2020 13:36
@dsyme
Copy link
Collaborator Author
dsyme commented Apr 29, 2020

I've merged this with dev and it should now be ready (once tests and coverage pass)

@codecov
Copy link
codecov bot commented Apr 29, 2020

Codecov Report

Merging #88 into dev will increase coverage by 5.36%.
The diff coverage is 76.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev      #88      +/-   ##
==========================================
+ Coverage   67.11%   72.47%   +5.36%     
==========================================
  Files          18       18              
  Lines        5269     5714     +445     
  Branches     1296     1325      +29     
==========================================
+ Hits         3536     4141     +605     
+ Misses       1017      837     -180     
- Partials      716      736      +20     
Impacted Files Coverage Δ
src/DiffSharp.Core/Extensions.fs 58.16% <0.00%> (+3.32%) ⬆️
src/DiffSharp.Core/RawTensor.fs 86.26% <ø> (-1.14%) ⬇️
src/DiffSharp.Core/Tensor.fs 78.20% <ø> (+8.63%) ⬆️
src/DiffSharp.Core/Shape.fs 56.57% <60.00%> (+4.78%) ⬆️
...iffSharp.Backends.Reference/Reference.RawTensor.fs 71.14% <100.00%> (+1.07%) ⬆️
src/DiffSharp.Backends.Torch/Torch.RawTensor.fs 85.81% <100.00%> (-0.75%) ⬇️
src/DiffSharp.Core/DiffSharp.Numerical.fs 77.63% <0.00%> (-3.95%) ⬇️
src/DiffSharp.Core/Data.fs 74.67% <0.00%> (+0.69%) ⬆️
... and 13 more

@dsyme
Copy link
Collaborator Author
dsyme commented May 4, 2020

@gbaydin I've merged this with dev, it should now be ready

@gbaydin
Copy link
Member
gbaydin commented May 5, 2020

I think the behavior of the transpose operation in this branch is not consistent with PyTorch.

PyTorch transpose behavior is as follows:

a = torch.randn([])
b = torch.randn([10])
c = torch.randn([10,20])
d = torch.randn([10,20,30])
e = torch.randn([10,20,30,40])
at = torch.t(a) # Gives shape []
bt = torch.t(b) # Gives shape [10]
ct = torch.t(c) # Gives shape [20, 10]
dt = torch.t(d) # Fails because d.dim > 2
et = torch.t(e) # Fails because d.dim > 2

DiffSharp behavior in this branch is as follows:

let a = dsharp.randn([])
let b = dsharp.randn([10])
let c = dsharp.randn([10;20])
let d = dsharp.randn([10;20;30])
let e = dsharp.randn([10;20;30;40])
let at = a.transpose() // Fails because a.dim < 2
let bt = b.transpose() // Fails because b.dim < 2
let ct = c.transpose() // Gives shape [20; 10]
let dt = d.transpose() // Gives shape [10; 30; 20]
let et = e.transpose() // Gives shape [10; 20; 40; 30]

I believe there is no need for a "batch-transpose" operation in general and batch transposition can be achieved when we have the general transpose operation torch.transpose(input,dim0,dim1) which can freely swap any two dimensions in all shapes of tensors. https://pytorch.org/docs/stable/torch.html#torch.transpose

I think this batch-transpose behavior was needed mainly for the reverse mode of batch matrix multiplication. We can either:

  • keep this and rename it to something like batchTranspose (and make it internal?) or
  • implement the general transpose(dim0, dim1) (which is currently missing and is a needed operation anyway) and use it for this purpose.

I would go with the second option because it improves the api. In both cases we can have the regular transpose (without dim0, dim1) behave the same way with PyTorch.

@dsyme
Copy link
Collaborator Author
dsyme commented May 6, 2020

Yup agreed. There's also torch.permute which is even more general right?

https://stackoverflow.com/questions/57512113/how-to-take-a-transpose-for-each-matrix-in-a-batch-in-pytorch

@dsyme
Copy link
Collaborator Author
dsyme commented May 6, 2020

keep this and rename it to something like batchTranspose (and make it internal?) or

This might be the simplest option for now, to get the batch matmul implementation and tests consolidated? Then deal with the second issue?

@gbaydin
Copy link
Member
gbaydin commented May 6, 2020

keep this and rename it to something like batchTranspose (and make it internal?) or

This might be the simplest option for now, to get the batch matmul implementation and tests consolidated? Then deal with the second issue?

Ok agreed. So then let's go with renaming the current transpose to batchTranspose and reintroducing a separate transpose that is identical to PyTorch's t. We can deal with the general transpose(dim0, dim1) separately.

@gbaydin
Copy link
Member
gbaydin commented May 6, 2020

@gbaydin 8000 Can we relax the code coverage settings? It's really tricky to get them going up all the time " 63.49% of diff hit (target 63.67%)"

These are the default settings and I agree they are too strict. I looked into relaxing them before but it was surprisingly difficult to find the information. Let me look a bit better and fix it. :)

@dsyme
Copy link
Collaborator Author
dsyme commented May 6, 2020

These are the default settings and I agree they are too strict. I looked into relaxing them before but it was surprisingly difficult to find the information. Let me look a bit better and fix it. :)

If you like give me admin rights? I can poke around

@gbaydin
Copy link
Member
gbaydin commented May 6, 2020

These are the default settings and I agree they are too strict. I looked into relaxing them before but it was surprisingly difficult to find the information. Let me look a bit better and fix it. :)

If you like give me admin rights? I can poke around

I think you already have it. Codecov uses GitHub permissions https://codecov.io/gh/DiffSharp/DiffSharp/ Please let me know if it doesn't work.

@gbaydin
Copy link
Member
gbaydin commented May 6, 2020

I think we need to set the project/patch targets to custom values using https://github.com/DiffSharp/DiffSharp/blob/dev/codecov.yml. See here: https://docs.codecov.io/docs/commit-status

Edit: I did make some changes in the threshold values. It should now allow coverage to fall by 10% in a PR before failing. I don't know if it will work as intended. I guess we will see.

@gbaydin
Copy link
Member
gbaydin commented May 6, 2020

Inspecting this further, I have the following concerns about the PR:

  • The behavior of DiffSharp matmul is not the same with torch.matmul which has special cases for handling 1d arguments (e.g., doing matrix-vector products for 2d-1d combinations, dot product for the 1d-1d combination), doing a normal matmul (2d-2d), and doing a batched matmul (bmm, see below) with broadcasting according to the dimensionality of the arguments.
  • In torch there is a torch.bmm that does batch-matmul only for 3d tensors. But DiffSharp matmul is not the same with this operation either because it works with >=2d tensors and supports broadcasting.

I think if we can get matmul to behave identical to torch.matmul, it would be great. Introducing a bmm that works with 3d-3d only is a minor thing and should be straightforward.

I had other comments about the change of MatMulT2T2 to MatMulTT in the RawTensor API. But I think this is actually a good simplification and we should keep it. An alternative I was thinking about was to have MatMulT2T2 and MatMulT3T3 that map to gemm and gemm_batch type of operations in CUDA, MKL, etc. (https://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply/). But I can see why MatMulTT is done in this way in this PR to work with the broadcasting expansions. In the backends, we can do the best calls to gemm, gemm_batch, etc. from within MatMulTT.

Another thing that we need to think about is whether to introduce MatMulT2T1 or MatMulT1T1 (or DotTT) type of RawTensor operations for matrix-vector and dot products. These would correspond to things like sgemv, sdot in the low level. But probably handling these through the same MatMulTT method and calling the necessary low-level method from there should be fine. In Tensor level these correspond to torch.mv and torch.dot.

@dsyme
Copy link
Collaborator Author
dsyme commented May 7, 2020

Inspecting this further, I have the following concerns about the PR:

Cool, makes sense

@dsyme dsyme closed this May 8, 2020
@dsyme dsyme reopened this May 8, 2020
@gbaydin
Copy link
Member
gbaydin commented May 13, 2020

Some relevant discussions here: pytorch/pytorch#18027 and here: tensorflow/tensorflow#5523

@dsyme
Copy link
Collaborator Author
dsyme commented Sep 9, 2020

@gbaydin

I think if we can get matmul to behave identical to torch.matmul, it would be great. Introducing a bmm that works with 3d-3d only is a minor thing and should be straightforward.

I have made and tested this adjustment based on the Pytorch docs for matmul and this should now be ready

Another thing that we need to think about is whether to introduce MatMulT2T1 or MatMulT1T1 (or DotTT) ...

I haven't done these parts.

@gbaydin
Copy link
Member
gbaydin commented Sep 9, 2020

Thank you, I will have a look and merge soon.

@dsyme
Copy link
Collaborator Author
dsyme commented Sep 29, 2020

@gbaydin ping :)

@dsyme dsyme mentioned this pull request Oct 20, 2020
@gbaydin gbaydin merged commit 083ac37 into dev Oct 26, 2020
@gbaydin
Copy link
Member
gbaydin commented Oct 26, 2020

Finally merged this (with a lot of delay!)

@dsyme
Copy link
Collaborator Author
dsyme commented Nov 3, 2020

Something went wrong with the overall diff for this in Tensor.fs, which is showing 6000+ line changes (nearly all whitespace)

This merge resulted in a complete rewrite of the file, I'm not sure why: 3c94672

I'll try to undo and force push a simpler diff

@dsyme dsyme mentioned this pull request Nov 3, 2020
dsyme added a commit that referenced this pull request Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0