feature/extension-ops: user-defined extensions #89

dsyme · 2020-02-26T17:15:35Z

I have prototyped a possible future feature where the user can define new primitive unary and binary operations in a nice and declarative way "from the outside", fully supporting nested derivatives etc. and specify their gradient functions.

This is the logical equivalent to TensorFlow gradient extensions (RegisterGradient etc.) and makes DiffSharp a much more open design while still supporting nested mixed mode forward/bacward derivatives etc.

For example we define Sin and Cos purely in terms of raw tensors and the reduction of their (nested) derivatives:

    type Tensor with
        member a.sinn() = 
            Tensor.Op
               { new UnaryOp() with 
                    member _.ComputeRaw(a) = a.SinT()
                    member _.Forward(fab,a,ad) = a.coss() * ad
                    member _.Reverse(a,ad) = a.coss() * ad }
               a

        member a.coss() = 
            Tensor.Op
               { new UnaryOp() with 
                    member _.ComputeRaw(a) = a.CosT()
                    member _.Forward(fa,a,ad) = -a.sinn() * ad 
                    member _.Reverse(a,ad) = -a.sinn() * ad }
                a

Then t.sinn() and t.coss() are used as a usual Tensor creation as if we had defined Tensor.Sin and Tensor,Cos:

            let fwdx = dsharp.tensor(...)
            fwdx.sinn()

etc. Note that extensions nearly always reduce in terms of themselves, hence the recursive-reduction uses in the implementation of the extensions.

Here is a binary extension - :

    type Tensor with
        static member mull (a, b) = 
            Tensor.Op
                { new BinaryOp() with 
                    member _.ComputeRaw(a,b) = a.MulTT(b)
                    member _.Forward(fab, a, ad, b, bd) = Tensor.mull(b, ad) + Tensor.mull(a, bd)
                    member _.ReverseA(a, b, td) = Tensor.mull(b, td)
                    member _.ReverseB(a, b, td) = Tensor.mull(a, td)
                }
                (a, b)

The ReverseA and ReverseB members apportioning the adjoint td to the inputs.

And a more complex one for the power function:

    type Tensor with

        member a.poww(b) = 
            Tensor.Op
                { new BinaryOp() with 
                    member _.ComputeRaw(a,b) = a.PowTT(b)
                    member _.Forward(fa, a, ad, b, bd) = (a ** (b - 1.)) * (b * ad + a * log a * bd)
                    member _.ForwardA(fa, a, ad, b) = (a ** (b - 1.)) * b * ad
                    member _.ForwardB(fa, a, b, bd) = fa * log a * bd
                    member _.ReverseA(a, b, td) = (a ** (b - 1.)) * b * td
                    member _.ReverseB(a, b, td) = (a ** b) * log a * td }
                (a,b)

It is optional to implement ForwardA and ForwardB members that compute the partial derivative when bd, ad respectively are zero.

The mathematical function that each of Forward, Reverse etc. must compute is in the docs for the functions, essentially each is based on the product of the Jacobian of the function and the differential vector (for Forward) and the transpose of the Jacobian with the adjoint (for Reverse).

p.s. As an aside we could move a lot of logic out of Tensor.fs via this route - indeed Tensor.fs could largely be reduced to a core associated with Tensor/TensorF/TensorR and neting interactions, and most of the individual operations could be defined outside. There is no need to extend TensorOp as we add new operations - the operations are now characterized by closures)

p^2.s. There is no significant difference in performance by this technique.

p^3.s. One thing to note is that we add more intensional information (such as the ability to export ONNX graphs based on traces or the like ) the extension definitions may need to contain more information

p^4.s. This could be extended to N->1 operations, though I'd need @gbaydin 's deep help with regard to the interactions for nesting etc. It could also be extended to 1->2, 1->N etc. but again I'd need more help.

p^5.s. The extension is ultimately in terms of a RawTensor operation Raw. As is somewhat obvious, that in turn may require extending the RawTensor design or else having RawTensor and its implementations give access to the actual underlying values, handles etc. to access the internal data on CPU and GPU.

p^6.s. If we used this mechanism to define all operations from the outside, the core of DiffSharp would reduce to about 220 lines

p^7s. A few reasons to override backprop gradients:

Faster execution with approximate methods (e.g. normal distribution)
Training of quantized operations as you can't differentiate discrete values
A derivative doesn't exist but you can fake an approximation

dsyme · 2020-02-27T13:48:46Z

I updated this with a sample showing how Conv1D would be defined be an extension, rather than being built in, plus tests for this

gsomix · 2020-02-28T15:35:39Z

src/DiffSharp.Core/Tensor.fs

@@ -1334,6 +1336,7 @@ and TensorOp =
    | AcosT of Tensor
    | AtanT of Tensor
    | NewT
+    | OpT of (* inputs *) Tensor list * (* gradients *) ((* tangent *) Tensor -> (* input-grad *) Tensor list)


Would it perhaps be better to specialize unary and binary operations?

P.S. It's matter of taste, but I'd prefer DU labels or even better anonymous records instead of comments.

Yes we can add DU labels - or probably a whole new type

Specialising unary/binary is ok but we'll need n-ary extensions eventually. Also ones producing multiple results though I'm guessing we'll always have "same-differentiation level" constraints on those.

codecov · 2020-05-05T11:33:34Z

Codecov Report

Merging #89 (e66f417) into dev (c2f87b4) will increase coverage by 0.39%.
The diff coverage is 68.18%.

@@            Coverage Diff             @@
##              dev      #89      +/-   ##
==========================================
+ Coverage   63.49%   63.88%   +0.39%     
==========================================
  Files          22       22              
  Lines        5862     6421     +559     
  Branches     1477     1513      +36     
==========================================
+ Hits         3722     4102     +380     
- Misses       1359     1518     +159     
- Partials      781      801      +20

Impacted Files	Coverage Δ
src/DiffSharp.Core/Tensor.fs	`62.24% <68.18%> (-9.77%)`	⬇️
src/DiffSharp.Core/DiffSharp.Numerical.fs	`77.63% <0.00%> (-3.95%)`	⬇️
src/DiffSharp.Core/Util.fs	`54.40% <0.00%> (-3.61%)`	⬇️
src/DiffSharp.Data/Image.fs	`76.92% <0.00%> (-0.85%)`	⬇️
src/DiffSharp.Data/Data.fs	`61.18% <0.00%> (-0.50%)`	⬇️
src/DiffSharp.Core/Scalar.fs	`8.33% <0.00%> (-0.49%)`	⬇️
src/DiffSharp.Core/RawTensor.fs	`84.97% <0.00%> (-0.37%)`	⬇️
src/DiffSharp.Data/Plot.fs	`0.00% <0.00%> (ø)`
src/DiffSharp.Backends.Torch/Torch.RawTensor.fs	`79.26% <0.00%> (+0.34%)`	⬆️
... and 12 more

dsyme · 2020-05-23T17:33:48Z

The more I think about it the more I'm convinced that we an extension API will be needed. TBH I think we should already be using one - Tensor.fs is too large.

For example assume we codegen the complete API for TorchSharp, and then someone doing "real" work spots crucial functionality that's now available in TorchSharp but not yet surfaced in DiffSharp. They will need a path to be able to get access to that functionality in their DiffSharp coding without going and and adding it to RawTensor, Reference etc. I don't think we should choose a closed-world design that blocks users from adding new operations and gradients.

Realisitically any extension API will require extensions to also specify the implementations of non-gradient code on the backend relevant to the user, i.e. the RawTensor -> RawTensor code in this prototype.

In this protoype I chose examples where the operations on RawTensor were already available. That won't be the case normally and in general the user will likely cast up to TorchRawTensor, grab the TorchTensor, call the operation and pack a new TorchRawTensor. The Torch backend could provide that, giving something like this:

    type Tensor with
        member a.sinx() = 
            Tensor
               { new TorchUnaryOp with 
                    member _.ComputeRaw(a: TorchTensor) = a.Sin()
                    ... }
               a
      
        member a.cosx() = ...

where Compute in TorchUnaryOp requires a direct implementation on TorchTensor. The extension is specific to Torch, but that's ok and realistic.

…ture/extension-ops

gbaydin · 2020-05-24T03:21:30Z

the second two functions can be implemented in terms of the first

This is not surprising because the second two functions are special cases of the first function! :) They were implemented separately for efficiency (which you are saying will not make a significant difference) and abstracted from the big pattern matching structure that handles each special case accordingly. Even further unification of forward and reverse code into something that automatically falls out of a single local derivative code should be also possible. Given the core nature of these, I think it would be great if you can give me the first shot at this redesign. Then we can iterate. I also need to write lots of nested derivative tests to ensure the behavior doesn't change during a redesign. I agree that a simple extension API will be great.

dsyme · 2020-05-24T13:57:54Z

@gbaydin Yup, exactly.

which you are saying will not make a significant difference

I think with care we can get the same efficiency, but we have to be careful.

There are three perf issues I spotted.

the use of "zero" values in forward
the possibility that we unnecessarily compute all reverse components in ReverseTC and ReverseCT, when only one is needed
the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

On (2) one thought is that in the most general n-ary version of the mechanism the reverse should take the index of the reverse component:

let powReverse i (dt, a, b) = 
    match i with 
    | 0 -> dt * (a ** (b - 1.)) * b
    | 1 -> dt * (a ** b ) * log a

Alternatively the resulting two parts could be returned lazily.

Given the core nature of these, I think it would be great if you can give me the first shot at this redesign. Then we can iterate.

Yup, I'm aware of the central nature of this change, these are suggestions to help you shape things, mostly by applying software engineering principles while keeping an eye on performance. My recommendation would be to add some kind of extension mechanism like above and gradually start using it internally to factor out chunks of Tensor.fs into separate files.

I also need to write lots of nested derivative tests to ensure the behavior doesn't change during a redesign.

Yup

…ture/extension-ops

gbaydin · 2020-05-24T14:39:40Z

the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

Some examples other than exp where this avoids recomputation are softplus, sigmoid, sqrt. Note that all four examples are very commonly used in machine learning. I think we would add a few more operations of this nature in the future. This is also exploited in the reverse code currently.

…thub.com/DiffSharp/DiffSharp into feature/extension-ops

awf · 2020-06-05T15:58:23Z

the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

Some examples other than exp where this avoids recomputation are softplus, sigmoid, sqrt. Note that all four examples are very commonly used in machine learning. I think we would add a few more operations of this nature in the future. This is also exploited in the reverse code currently.

I think the general idea here is that sometimes the primal computation computes values that may be useful in the derivative. Sometimes these useful values may be the actual return value itself, but more likely they are some intermediates.

The example I keep in mind is

let g x = x * (sin x)
let dg x dx = (x * (cos x) + (sin x)) * dx

here the value of the primal is not that helpful for dg, but there are useful values in the processs of its computation. An alternative is for all primals to return an extra bag of goodies (bog) which can be passed to the derivative computations:

let g x = (let tmp = sin x in (x * tmp, tmp))
(* now user needs to say (fst (g x)) at call site *)
let dg (x, bog) dx = (x * (cos x) + bog) * dx

And of course this is not very far from returning a closure as in lambda the ultimate:

let g x = (let tmp = sin x in (x * tmp, dx -> (x * (cos x) + tmp) * dx))

And of further course, it's probably best in this case to write

let g x = (let tmp = sin x in (x * tmp, ())) (* empty bog *)
let dg (x, bog) dx = let s,c = sincos x in  (x * c + s) * dx

Where the bog is empty as sincos may be as fast as sin, and we avoid holding on to the temporary. Of course a scalar temporary is not a problem, but imagine all of the above operating elementwise on tensor x

gbaydin · 2020-06-05T18:15:38Z

I think the general idea here is that sometimes the primal computation computes values that may be useful in the derivative. Sometimes these useful values may be the actual return value itself, but more likely they are some intermediates.

Great point and examples! This subject is also connected with the general tradeoff between memory use and compute that applies to other things like reverse mode checkpointing, where one can trade memory usage (values to keep around) with compute (recomputing the same values a few times, when needed, to not keep them around in memory).

Ideally these behaviors should be tunable with some library configuration parameters.

…ture/extension-ops

dsyme · 2020-11-26T00:40:28Z

@gbaydin I've updated this, you might want to take another look

…ture/extension-ops

dsyme · 2021-03-27T15:57:27Z

@gbaydin I'd appreciate progress on getting this and #260 merged.

gbaydin · 2021-03-30T10:46:00Z

Note that I've been working on this and making progress. Will share soon.

gbaydin · 2021-04-02T18:46:13Z

Deprecated in favor of #311 which introduced a simpler extension api, based on the prototype code here. I will also make use of the docs written here.

dsyme added 2 commits February 26, 2020 17:02

prototype extension ops

7667054

conv1d extension example

270e7d2

dsyme added 2 commits February 27, 2020 14:24

conv1d extension example

f125503

adjust naming

e135663

gsomix reviewed Feb 28, 2020

View reviewed changes

integrate dev

c3c067e

dsyme mentioned this pull request Mar 2, 2020

Move conv rev diff ops out of main loop #90

Merged

merge dev

aab40f5

dsyme mentioned this pull request May 23, 2020

Operations requested for DiffSharp dotnet/TorchSharp#140

Closed

15 tasks

dsyme mentioned this pull request May 23, 2020

proposal for user choice of packages #111

Closed

dsyme added 3 commits May 24, 2020 00:08

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

42465db

…ture/extension-ops

simplify binary extensions

a03bf73

simplify extension api

6592c8e

dsyme added 2 commits May 24, 2020 15:09

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

b00aec4

…ture/extension-ops

cleanup

5ab0e27

dsyme added 3 commits June 1, 2020 19:22

Merge commit 'f8c432b048584947a3cb672e55723dd2170c0671' of https://gi…

5c2648b

…thub.com/DiffSharp/DiffSharp into feature/extension-ops

Merge commit '211c5f6ecdbb122f33afdb33196317bf6fcd8407' of https://gi…

154c66c

…thub.com/DiffSharp/DiffSharp into feature/extension-ops

merge dev

c27a9ed

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

cfd05d0

…ture/extension-ops

dsyme mentioned this pull request Sep 9, 2020

make tensor type representation internal #176

Merged

Don Syme added 2 commits September 15, 2020 23:30

integrate dev

70d34a4

merge dev

e5e4e41

Don Syme added 3 commits November 20, 2020 12:59

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

2ece5dd

…ture/extension-ops

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

72276fa

…ture/extension-ops

update docs

be40af9

dsyme mentioned this pull request Nov 25, 2020

feature/avgpool - avgpool1d, avgpool2d, avgpool3d #252

Merged

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

6c308a2

…ture/extension-ops

Don Syme added 8 commits November 26, 2020 15:08

integrate dev

98cd8c2

Merge branch 'tsplit1' into feature/extension-ops

757d104

Merge branch 'tsplit1' into feature/extension-ops

7c02d48

rename file

b3f874e

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

4966630

…ture/extension-ops

allow reverse mode to have access to computed f(a)

37e90ff

fix comment

8608ce7

fix comment

a38b116

dsyme mentioned this pull request Nov 28, 2020

[WIP] feature/activations #260

Closed

F438

tidy up

5a8ae8c

dsyme changed the title ~~Future Feature: user-defined extensions~~ feature/extension-ops: user-defined extensions Nov 28, 2020

dsyme mentioned this pull request Nov 29, 2020

Review of merged #235 and #262 #258

Closed

Don Syme added 7 commits November 30, 2020 18:23

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

57ded3e

…ture/extension-ops

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

a383c5f

…ture/extension-ops

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

331c14c

…ture/extension-ops

integrate dev

7014b1f

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

e587849

…ture/extension-ops

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

64dc91c

…ture/extension-ops

Merge branch 'dev' of https://github.com/DiffSharp/DiffSharp into fea…

e66f417

…ture/extension-ops

gbaydin mentioned this pull request Mar 31, 2021

Simpler extension API #311

Merged

gbaydin closed this Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature/extension-ops: user-defined extensions #89

feature/extension-ops: user-defined extensions #89

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feature/extension-ops: user-defined extensions #89

feature/extension-ops: user-defined extensions #89

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!