8000 feature/extension-ops: user-defined extensions by dsyme · Pull Request #89 · DiffSharp/DiffSharp · GitHub
[go: up one dir, main page]

Skip to content

feature/extension-ops: user-defined extensions #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 44 commits into from

Conversation

dsyme
Copy link
Collaborator
@dsyme dsyme commented Feb 26, 2020

I have prototyped a possible future feature where the user can define new primitive unary and binary operations in a nice and declarative way "from the outside", fully supporting nested derivatives etc. and specify their gradient functions.

This is the logical equivalent to TensorFlow gradient extensions (RegisterGradient etc.) and makes DiffSharp a much more open design while still supporting nested mixed mode forward/bacward derivatives etc.

For example we define Sin and Cos purely in terms of raw tensors and the reduction of their (nested) derivatives:

    type Tensor with
        member a.sinn() = 
            Tensor.Op
               { new UnaryOp() with 
                    member _.ComputeRaw(a) = a.SinT()
                    member _.Forward(fab,a,ad) = a.coss() * ad
                    member _.Reverse(a,ad) = a.coss() * ad }
               a

        member a.coss() = 
            Tensor.Op
               { new UnaryOp() with 
                    member _.ComputeRaw(a) = a.CosT()
                    member _.Forward(fa,a,ad) = -a.sinn() * ad 
                    member _.Reverse(a,ad) = -a.sinn() * ad }
                a

Then t.sinn() and t.coss() are used as a usual Tensor creation as if we had defined Tensor.Sin and Tensor,Cos:

            let fwdx = dsharp.tensor(...)
            fwdx.sinn()

etc. Note that extensions nearly always reduce in terms of themselves, hence the recursive-reduction uses in the implementation of the extensions.

Here is a binary extension - :

    type Tensor with
        static member mull (a, b) = 
            Tensor.Op
                { new BinaryOp() with 
                    member _.ComputeRaw(a,b) = a.MulTT(b)
                    member _.Forward(fab, a, ad, b, bd) = Tensor.mull(b, ad) + Tensor.mull(a, bd)
                    member _.ReverseA(a, b, td) = Tensor.mull(b, td)
                    member _.ReverseB(a, b, td) = Tensor.mull(a, td)
                }
                (a, b)

The ReverseA and ReverseB members apportioning the adjoint td to the inputs.

And a more complex one for the power function:

    type Tensor with

        member a.poww(b) = 
            Tensor.Op
                { new BinaryOp() with 
                    member _.ComputeRaw(a,b) = a.PowTT(b)
                    member _.Forward(fa, a, ad, b, bd) = (a ** (b - 1.)) * (b * ad + a * log a * bd)
                    member _.ForwardA(fa, a, ad, b) = (a ** (b - 1.)) * b * ad
                    member _.ForwardB(fa, a, b, bd) = fa * log a * bd
                    member _.ReverseA(a, b, td) = (a ** (b - 1.)) * b * td
                    member _.ReverseB(a, b, td) = (a ** b) * log a * td }
                (a,b)

It is optional to implement ForwardA and ForwardB members that compute the partial derivative when bd, ad respectively are zero.

The mathematical function that each of Forward, Reverse etc. must compute is in the docs for the functions, essentially each is based on the product of the Jacobian of the function and the differential vector (for Forward) and the transpose of the Jacobian with the adjoint (for Reverse).

p.s. As an aside we could move a lot of logic out of Tensor.fs via this route - indeed Tensor.fs could largely be reduced to a core associated with Tensor/TensorF/TensorR and neting interactions, and most of the individual operations could be defined outside. There is no need to extend TensorOp as we add new operations - the operations are now characterized by closures)

p^2.s. There is no significant difference in performance by this technique.

p^3.s. One thing to note is that we add more intensional information (such as the ability to export ONNX graphs based on traces or the like ) the extension definitions may need to contain more information

p^4.s. This could be extended to N->1 operations, though I'd need @gbaydin 's deep help with regard to the interactions for nesting etc. It could also be extended to 1->2, 1->N etc. but again I'd need more help.

p^5.s. The extension is ultimately in terms of a RawTensor operation Raw. As is somewhat obvious, that in turn may require extending the RawTensor design or else having RawTensor and its implementations give access to the actual underlying values, handles etc. to access the internal data on CPU and GPU.

p^6.s. If we used this mechanism to define all operations from the outside, the core of DiffSharp would reduce to about 220 lines

p^7s. A few reasons to override backprop gradients:

  • Faster execution with approximate methods (e.g. normal distribution)
  • Training of quantized operations as you can't differentiate discrete values
  • A derivative doesn't exist but you can fake an approximation

@dsyme
Copy link
Collaborator Author
dsyme commented Feb 27, 2020

I updated this with a sample showing how Conv1D would be defined be an extension, rather than being built in, plus tests for this

@@ -1334,6 +1336,7 @@ and TensorOp =
| AcosT of Tensor
| AtanT of Tensor
| NewT
| OpT of (* inputs *) Tensor list * (* gradients *) ((* tangent *) Tensor -> (* input-grad *) Tensor list)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it perhaps be better to specialize unary and binary operations?

P.S. It's matter of taste, but I'd prefer DU labels or even better anonymous records instead of comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can add DU labels - or probably a whole new type

Specialising unary/binary is ok but we'll need n-ary extensions eventually. Also ones producing multiple results though I'm guessing we'll always have "same-differentiation level" constraints on those.

@codecov
Copy link
codecov bot commented May 5, 2020

Codecov Report

Merging #89 (e66f417) into dev (c2f87b4) will increase coverage by 0.39%.
The diff coverage is 68.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev      #89      +/-   ##
==========================================
+ Coverage   63.49%   63.88%   +0.39%     
==========================================
  Files          22       22              
  Lines        5862     6421     +559     
  Branches     1477     1513      +36     
==========================================
+ Hits         3722     4102     +380     
- Misses       1359     1518     +159     
- Partials      781      801      +20     
Impacted Files Coverage Δ
src/DiffSharp.Core/Tensor.fs 62.24% <68.18%> (-9.77%) ⬇️
src/DiffSharp.Core/DiffSharp.Numerical.fs 77.63% <0.00%> (-3.95%) ⬇️
src/DiffSharp.Core/Util.fs 54.40% <0.00%> (-3.61%) ⬇️
src/DiffSharp.Data/Image.fs 76.92% <0.00%> (-0.85%) ⬇️
src/DiffSharp.Data/Data.fs 61.18% <0.00%> (-0.50%) ⬇️
src/DiffSharp.Core/Scalar.fs 8.33% <0.00%> (-0.49%) ⬇️
src/DiffSharp.Core/RawTensor.fs 84.97% <0.00%> (-0.37%) ⬇️
src/DiffSharp.Data/Plot.fs 0.00% <0.00%> (ø)
src/DiffSharp.Backends.Torch/Torch.RawTensor.fs 79.26% <0.00%> (+0.34%) ⬆️
... and 12 more

@dsyme
Copy link
Collaborator Author
dsyme commented May 23, 2020

The more I think about it the more I'm convinced that we an extension API will be needed. TBH I think we should already be using one - Tensor.fs is too large.

For example assume we codegen the complete API for TorchSharp, and then someone doing "real" work spots crucial functionality that's now available in TorchSharp but not yet surfaced in DiffSharp. They will need a path to be able to get access to that functionality in their DiffSharp coding without going and and adding it to RawTensor, Reference etc. I don't think we should choose a closed-world design that blocks users from adding new operations and gradients.

Realisitically any extension API will require extensions to also specify the implementations of non-gradient code on the backend relevant to the user, i.e. the RawTensor -> RawTensor code in this prototype.

In this protoype I chose examples where the operations on RawTensor were already available. That won't be the case normally and in general the user will likely cast up to TorchRawTensor, grab the TorchTensor, call the operation and pack a new TorchRawTensor. The Torch backend could provide that, giving something like this:

    type Tensor with
        member a.sinx() = 
            Tensor
               { new TorchUnaryOp with 
                    member _.ComputeRaw(a: TorchTensor) = a.Sin()
                    ... }
               a
      
        member a.cosx() = ...

where Compute in TorchUnaryOp requires a direct implementation on TorchTensor. The extension is specific to Torch, but that's ok and realistic.

@gbaydin
Copy link
Member
gbaydin commented May 24, 2020

the second two functions can be implemented in terms of the first

This is not surprising because the second two functions are special cases of the first function! :) They were implemented separately for efficiency (which you are saying will not make a significant difference) and abstracted from the big pattern matching structure that handles each special case accordingly. Even further unification of forward and reverse code into something that automatically falls out of a single local derivative code should be also possible. Given the core nature of these, I think it would be great if you can give me the first shot at this redesign. Then we can iterate. I also need to write lots of nested derivative tests to ensure the behavior doesn't change during a redesign. I agree that a simple extension API will be great.

@dsyme
Copy link
Collaborator Author
dsyme commented May 24, 2020

@gbaydin Yup, exactly.

which you are saying will not make a significant difference

I think with care we can get the same efficiency, but we have to be careful.

There are three perf issues I spotted.

  1. the use of "zero" values in forward

  2. the possibility that we unnecessarily compute all reverse components in ReverseTC and ReverseCT, when only one is needed

  3. the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

On (2) one thought is that in the most general n-ary version of the mechanism the reverse should take the index of the reverse component:

let powReverse i (dt, a, b) = 
    match i with 
    | 0 -> dt * (a ** (b - 1.)) * b
    | 1 -> dt * (a ** b ) * log a

Alternatively the resulting two parts could be returned lazily.

Given the core nature of these, I think it would be great if you can give me the first shot at this redesign. Then we can iterate.

Yup, I'm aware of the central nature of this change, these are suggestions to help you shape things, mostly by applying software engineering principles while keeping an eye on performance. My recommendation would be to add some kind of extension mechanism like above and gradually start using it internally to factor out chunks of Tensor.fs into separate files.

I also need to write lots of nested derivative tests to ensure the behavior doesn't change during a redesign.

Yup

@gbaydin
Copy link
Member
gbaydin commented May 24, 2020

the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

Some examples other than exp where this avoids recomputation are softplus, sigmoid, sqrt. Note that all four examples are very commonly used in machine learning. I think we would add a few more operations of this nature in the future. This is also exploited in the reverse code currently.

@awf
Copy link
awf commented Jun 5, 2020

the question of whether we provide f(x) primal (called cp in the code) when computing forward df - this is currently provided in the forward phases but not AFAICS in the reverse phases. This matters for exp and other cases where f(x) features exactly in the gradient.

Some examples other than exp where this avoids recomputation are softplus, sigmoid, sqrt. Note that all four examples are very commonly used in machine learning. I think we would add a few more operations of this nature in the future. This is also exploited in the reverse code currently.

I think the general idea here is that sometimes the primal computation computes values that may be useful in the derivative. Sometimes these useful values may be the actual return value itself, but more likely they are some intermediates.

The example I keep in mind is

let g x = x * (sin x)
let dg x dx = (x * (cos x) + (sin x)) * dx

here the value of the primal is not that helpful for dg, but there are useful values in the processs of its computation. An alternative is for all primals to return an extra bag of goodies (bog) which can be passed to the derivative computations:

let g x = (let tmp = sin x in (x * tmp, tmp))
(* now user needs to say (fst (g x)) at call site *)
let dg (x, bog) dx = (x * (cos x) + bog) * dx

And of course this is not very far from returning a closure as in lambda the ultimate:

let g x = (let tmp = sin x in (x * tmp, dx -> (x * (cos x) + tmp) * dx))

And of further course, it's probably best in this case to write

let g x = (let tmp = sin x in (x * tmp, ())) (* empty bog *)
let dg (x, bog) dx = let s,c = sincos x in  (x * c + s) * dx

Where the bog is empty as sincos may be as fast as sin, and we avoid holding on to the temporary. Of course a scalar temporary is not a problem, but imagine all of the above operating elementwise on tensor x

@gbaydin
Copy link
Member
gbaydin commented Jun 5, 2020

I think the general idea here is that sometimes the primal computation computes values that may be useful in the derivative. Sometimes these useful values may be the actual return value itself, but more likely they are some intermediates.

Great point and examples! This subject is also connected with the general tradeoff between memory use and compute that applies to other things like reverse mode checkpointing, where one can trade memory usage (values to keep around) with compute (recomputing the same values a few times, when needed, to not keep them around in memory).

Ideally these behaviors should be tunable with some library configuration parameters.

@dsyme
Copy link
Collaborator Author
dsyme commented Nov 26, 2020

@gbaydin I've updated this, you might want to take another look

@dsyme dsyme mentioned this pull request Nov 28, 2020
@dsyme dsyme changed the title Future Feature: user-defined extensions feature/extension-ops: user-defined extensions Nov 28, 2020
@dsyme
Copy link
Collaborator Author
dsyme commented Mar 27, 2021

@gbaydin I'd appreciate progress on getting this and #260 merged.

@gbaydin
Copy link
Member
gbaydin commented Mar 30, 2021

Note that I've been working on this and making progress. Will share soon.

@gbaydin gbaydin mentioned this pull request Mar 31, 2021
@gbaydin
Copy link
Member
gbaydin commented Apr 2, 2021

Deprecated in favor of #311 which introduced a simpler extension api, based on the prototype code here. I will also make use of the docs written here.

@gbaydin gbaydin closed this Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0