-
Notifications
You must be signed in to change notification settings - Fork 18.8k
dockerd gpu support #37504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dockerd gpu support #37504
Conversation
Please sign your commits following these rules: $ git clone -b "dockerd-gpu-support" git@github.com:flx42/moby.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842353798760
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f Amending updates the existing PR. You DO NOT need to open a new one. |
724547a
to
70214b0
Compare
70214b0
to
eade168
Compare
eade168
to
1c66597
Compare
libcontainerd/client_daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call this code in the handler. You can call the spec opts like any other function. You can just make a simple combinator for this.
in your gpu handler dont use annotations and just process the fields like:
func applyOpts(s *specs.Spec, opts ...SpecOpts) {
for _, o := range opts {
o(ctx, nil, nil, s)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you just build your gpu opts and call nvidia.WithGPUs(nvidiaOpts)(ctx, nil, nil, spec)
c4a42d3
to
b6f5ffe
Compare
Codecov Report
@@ Coverage Diff @@
## master #37504 +/- ##
=========================================
Coverage ? 35.6%
=========================================
Files ? 611
Lines ? 44971
Branches ? 0
=========================================
Hits ? 16014
Misses ? 26744
Partials ? 2213 |
This new feature is now behind the |
Removed the WIP tag from the title |
daemon/gpu/gpu.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need to do this in a func init()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double-check the order of initialization. I did the same as:
https://github.com/moby/moby/blob/master/daemon/graphdriver/driver.go#L148-L150
daemon/oci_linux.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use an error from the errdefs
package.
daemon/oci_linux.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the experimental check should be higher up in the stack, and greatly simplify this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, and do you prefer passing a boolean to setResources
instead of the daemon
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I would just not call setGPUResources
if the daemon is not experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I'm talking about the function one step up: setResources
, this function needs to be called in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... a little gross but... maybe we can do setResources
and setResourcesExperimental
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable.
libcontainerd/client_daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it says on the tin, sorry for the revendoring noise :)
daemon/gpu/nvidia/nvidia.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior from Michael's nvidia
plugin is to use containerd
as the binary. The hook is executed as:
containerd oci-hook -- nvidia-container-cli ... [args] ...
Maybe it should have been a separate binary, but that's the status right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case we need docker-containerd
, not containerd
.
daemon/gpu/nvidia/nvidia.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errdefs
daemon/gpu/nvidia/nvidia.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context should never be passed as a nil
daemon/oci_linux.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was mentioning above, do you want to pass a boolean here instead of the daemon?
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
b6f5ffe
to
88eff77
Compare
Updated my PR to make the changes requested by @cpuguy83. I dropped the idea of The vendoring is still hacky since this is not the right place to do it I believe, and my understanding is that we would want to wait for a new minor/patch version of containerd anyway. EDIT: messed up the commit split, fixing now. |
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
88eff77
to
658675e
Compare
} | ||
|
||
if len(r.GPUs) > 0 { | ||
if !experimental { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to push this experimental check up the stack so we don't have to deal with it here?
Also, do you really feel this should be experimental? Are you worried the API is not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to push this experimental check up the stack so we don't have to deal with it here?
I initially considered what you said previously, that is to say to create a function setResourcesExperimental
and put it here after setResources
:
Lines 737 to 739 in 658675e
if err := setResources(&s, c.HostConfig.Resources, daemon.HasExperimental()); err != nil { | |
return nil, fmt.Errorf("linux runtime spec resources: %v", err) | |
} |
Then I had two options:
if daemon.HasExperimental() {
if err := setResourcesExperimental(...) {
return nil, fmt.Errorf(...)
}
}
That's not great since we lose the ability to warn/error if GPUs are requested but the experimental flag is not set.
The other option would have been:
if err := setResourcesExperimental(..., daemon.HasExperimental()) {
return nil, fmt.Errorf(...)
}
But it's bizarre to have a function setResourcesExperimental
and pass a boolean flag for this function to really do something.
Let me know if you see another option.
Moving this conversation to the main thread since it's a more fundamental question.
But for all of the above, with my implementation being the sole client/consumer, I would understand gating that behind an |
@thaJeztah how does vendoring work when there are client and server changes? |
@crosbymichael you mean if changes are needed both in the CLI and on the daemon side? |
I guess that's what he meant. My guess is that the ordering needs to be:
Is that correct? |
Ping @thaJeztah @crosbymichael do you confirm we should gate this feature behind the |
@thaJeztah @crosbymichael since I imagine we're now close to the |
@flx42 Yes. |
@flx42 this one needs to be rebased |
Closed by #38828 |
API discussion: #37434
CLI discussion: docker/cli#1200
This is a simple prototype for GPU support in
dockerd
. It uses a simple in-tree plugin mechanism similar to the graphdriver plugins indaemon/graphdriver/...
, but stateless.Plugin API
Since plugins have access to the full OCI spec, I expect plugins to do one (or multiple) of the following:
Is this acceptable? Or do you want to restrict which part of the spec the plugins can edit?
Discussion
The demo implementation for the
nvidia
plugin in this PR relies on the existingnvidia-container-runtime-hook
binary since I'm not sure yet how you think we should leverage the recently added GPU support in containerd.This feel a bit awkward since we're not sure what other implementations could look like.
If we do this, we technically could get rid of OCI annotations since everything is handled in
dockerd
, but I think it's good to keep a trace of what happened inconfig.json
, even if it's plain annotations.containerd
directly?