-
Notifications
You must be signed in to change notification settings - Fork 70
feature/extension-ops: user-defined extensions #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I updated this with a sample showing how Conv1D would be defined be an extension, rather than being built in, plus tests for this |
src/DiffSharp.Core/Tensor.fs
Outdated
@@ -1334,6 +1336,7 @@ and TensorOp = | |||
| AcosT of Tensor | |||
| AtanT of Tensor | |||
| NewT | |||
| OpT of (* inputs *) Tensor list * (* gradients *) ((* tangent *) Tensor -> (* input-grad *) Tensor list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it perhaps be better to specialize unary and binary operations?
P.S. It's matter of taste, but I'd prefer DU labels or even better anonymous records instead of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can add DU labels - or probably a whole new type
Specialising unary/binary is ok but we'll need n-ary extensions eventually. Also ones producing multiple results though I'm guessing we'll always have "same-differentiation level" constraints on those.
Codecov Report
@@ Coverage Diff @@
## dev #89 +/- ##
==========================================
+ Coverage 63.49% 63.88% +0.39%
==========================================
Files 22 22
Lines 5862 6421 +559
Branches 1477 1513 +36
==========================================
+ Hits 3722 4102 +380
- Misses 1359 1518 +159
- Partials 781 801 +20
|
The more I think about it the more I'm convinced that we an extension API will be needed. TBH I think we should already be using one - Tensor.fs is too large. For example assume we codegen the complete API for TorchSharp, and then someone doing "real" work spots crucial functionality that's now available in TorchSharp but not yet surfaced in DiffSharp. They will need a path to be able to get access to that functionality in their DiffSharp coding without going and and adding it to RawTensor, Reference etc. I don't think we should choose a closed-world design that blocks users from adding new operations and gradients. Realisitically any extension API will require extensions to also specify the implementations of non-gradient code on the backend relevant to the user, i.e. the In this protoype I chose examples where the operations on RawTensor were already available. That won't be the case normally and in general the user will likely cast up to type Tensor with
member a.sinx() =
Tensor
{ new TorchUnaryOp with
member _.ComputeRaw(a: TorchTensor) = a.Sin()
... }
a
member a.cosx() = ... where |
This is not surprising because the second two functions are special cases of the first function! :) They were implemented separately for efficiency (which you are saying will not make a significant difference) and abstracted from the big pattern matching structure that handles each special case accordingly. Even further unification of forward and reverse code into something that automatically falls out of a single local derivative code should be also possible. Given the core nature of these, I think it would be great if you can give me the first shot at this redesign. Then we can iterate. I also need to write lots of nested derivative tests to ensure the behavior doesn't change during a redesign. I agree that a simple extension API will be great. |
@gbaydin Yup, exactly.
I think with care we can get the same efficiency, but we have to be careful. There are three perf issues I spotted.
On (2) one thought is that in the most general n-ary version of the mechanism the reverse should take the index of the reverse component:
Alternatively the resulting two parts could be returned lazily.
Yup, I'm aware of the central nature of this change, these are suggestions to help you shape things, mostly by applying software engineering principles while keeping an eye on performance. My recommendation would be to add some kind of extension mechanism like above and gradually start using it internally to factor out chunks of Tensor.fs into separate files.
Yup |
…ture/extension-ops
Some examples other than |
…thub.com/DiffSharp/DiffSharp into feature/extension-ops
…thub.com/DiffSharp/DiffSharp into feature/extension-ops
I think the general idea here is that sometimes the primal computation computes values that may be useful in the derivative. Sometimes these useful values may be the actual return value itself, but more likely they are some intermediates. The example I keep in mind is let g x = x * (sin x)
let dg x dx = (x * (cos x) + (sin x)) * dx here the value of the primal is not that helpful for let g x = (let tmp = sin x in (x * tmp, tmp))
(* now user needs to say (fst (g x)) at call site *)
let dg (x, bog) dx = (x * (cos x) + bog) * dx And of course this is not very far from returning a closure as in lambda the ultimate: let g x = (let tmp = sin x in (x * tmp, dx -> (x * (cos x) + tmp) * dx)) And of further course, it's probably best in this case to write let g x = (let tmp = sin x in (x * tmp, ())) (* empty bog *)
let dg (x, bog) dx = let s,c = sincos x in (x * c + s) * dx Where the bog is empty as sincos may be as fast as sin, and we avoid holding on to the temporary. Of course a scalar temporary is not a problem, but imagine all of the above operating elementwise on tensor |
Great point and examples! This subject is also connected with the general tradeoff between memory use and compute that applies to other things like reverse mode checkpointing, where one can trade memory usage (values to keep around) with compute (recomputing the same values a few times, when needed, to not keep them around in memory). Ideally these behaviors should be tunable with some library configuration parameters. |
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
@gbaydin I've updated this, you might want to take another look |
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
…ture/extension-ops
Note that I've been working on this and making progress. Will share soon. |
Deprecated in favor of #311 which introduced a simpler extension api, based on the prototype code here. I will also make use of the docs written here. |
I have prototyped a possible future feature where the user can define new primitive unary and binary operations in a nice and declarative way "from the outside", fully supporting nested derivatives etc. and specify their gradient functions.
This is the logical equivalent to TensorFlow gradient extensions (RegisterGradient etc.) and makes DiffSharp a much more open design while still supporting nested mixed mode forward/bacward derivatives etc.
For example we define
Sin
andCos
purely in terms of raw tensors and the reduction of their (nested) derivatives:Then
t.sinn()
andt.coss()
are used as a usual Tensor creation as if we had definedTensor.Sin
andTensor,Cos
:etc. Note that extensions nearly always reduce in terms of themselves, hence the recursive-reduction uses in the implementation of the extensions.
Here is a binary extension - :
The
ReverseA
andReverseB
members apportioning the adjointtd
to the inputs.And a more complex one for the power function:
It is optional to implement
ForwardA
andForwardB
members that compute the partial derivative when bd, ad respectively are zero.The mathematical function that each of
Forward
,Reverse
etc. must compute is in the docs for the functions, essentially each is based on the product of the Jacobian of the function and the differential vector (for Forward) and the transpose of the Jacobian with the adjoint (for Reverse).p.s. As an aside we could move a lot of logic out of Tensor.fs via this route - indeed Tensor.fs could largely be reduced to a core associated with Tensor/TensorF/TensorR and neting interactions, and most of the individual operations could be defined outside. There is no need to extend
TensorOp
as we add new operations - the operations are now characterized by closures)p^2.s. There is no significant difference in performance by this technique.
p^3.s. One thing to note is that we add more intensional information (such as the ability to export ONNX graphs based on traces or the like ) the extension definitions may need to contain more information
p^4.s. This could be extended to
N->1
operations, though I'd need @gbaydin 's deep help with regard to the interactions for nesting etc. It could also be extended to1->2
,1->N
etc. but again I'd need more help.p^5.s. The extension is ultimately in terms of a RawTensor operation
Raw
. As is somewhat obvious, that in turn may require extending the RawTensor design or else having RawTensor and its implementations give access to the actual underlying values, handles etc. to access the internal data on CPU and GPU.p^6.s. If we used this mechanism to define all operations from the outside, the core of DiffSharp would reduce to about 220 lines
p^7s. A few reasons to override backprop gradients: