[WIP] api: add a GPU resource type #37434

flx42 · 2018-07-10T22:14:51Z

CLI discussion: docker/cli#1200

First draft of what the API might look like. I don't believe we need something more complex than that.

Discussion points

Since in the CLI discussion we agreed to namespace GPUs per vendor, naturally the entrypoint for the API is a map to access all the GPUs/options for a single vendor. We could have a single slice of GPUs and a single option map, but it feel awkward if a vendor needs to skip devices/options that it doesn't own.
Naming is hard, so Resources (plural) is all the GPUs from all vendor. Resource is all the GPUs from a single vendor. Thoughts?
Let's not support per-GPU options for now, but we can add that to the GPU struct later without breaking existing code.
In the CLI discussion we discussed a shortcut to select all GPUs: --gpus nvidia. However at this point it's not possible to get the actual list of GPU ids. We have two options: a) if gpu.Resource.GPUs is empty, then select all GPUs OR b) if gpu.Resource.GPUs[0].ID == "*" then select all GPUs.

crosbymichael · 2018-07-11T14:21:56Z

api/types/gpu/gpu.go

Yes, it is pretty confusing ;)

cpuguy83 · 2018-07-11T15:59:21Z

There's already some work (in swarmkit) around "GenericResources".
I wonder if that can, or should, be brought over into the runtime side for this.

https://godoc.org/github.com/docker/docker/api/types/swarm#GenericResource

It's a bit more abstract because it's designed to be pretty loose as resources are defined by admins for a given node and the scheduler matches the resources defined on the node with resources defined in the service.... so maybe not necessarily the best fit.

flx42 · 2018-07-11T16:16:53Z

Yes, the GenericResource support was added in swarmkit by my colleague @RenaudWasTaken. As you said this existing API seems to be a better fit for scheduling/matching purposes, and we'll lose the per-vendor separation that I think is useful for API users; but on the other hand this type might be reused for other purposes in the future too.

But those are not major problems, we could work with the existing GenericResource type, if we also attach option to each resource.

flx42 · 2018-07-11T16:29:19Z

So the fundamental question here is:

Do we want a GPU resource type? Or do we want a more generic API similar to Swarm?

I'm fine either ways.

cpuguy83 · 2018-07-11T16:30:25Z

I think a GPU type works fine and is much simpler to deal with.

flx42 · 2018-07-12T19:06:58Z

Updated my PR, we now have the following:

package container

type Resources struct {
        GPUsByVendor map[gpu.Vendor]*gpu.Resources
        [...]
}

package gpu

type Vendor string

type Resources struct {
        GPUs    []GPU
        Options map[string]string
}

type GPU struct {
        ID string
}

I don't think we need more than that today. If we agree on the API, I can start to flesh out the next step: integration in dockerd by transforming these GPU resources into OCI annotations, then calling into the new containerd API.

cpuguy83

I'm not really sure what makes sense from the GPU side of things, but I'd like to make sure the API does not limit how we can use this later on (or to what the proposed docker CLI solution can handle).

Would make make sense to have the API be GPUs []gpu.Resource where a gpu.Resource is:

package gpu

type Resource {
    Type string // e.g. 'nvidia'
    ID string // GPU UUID... whatever the vendor defines
    Options map[string]string
}

Also this may not require it's own package since there's not really much to it.

Anyway as I said I'm not sure what makes sense from the vendor perspective here but this seems like it makes the API a bit more extensible and the complexity of it can be handled by the client.

cpuguy83 · 2018-07-12T22:22:14Z

api/types/container/host_config.go

This limits the API quite a bit.
What do you think about using a slice here and then we can validate as needed in code.

flx42 · 2018-07-12T23:24:30Z

I have a concern with this approach:

type Resource {
    Type string // e.g. 'nvidia'
    ID string // GPU UUID... whatever the vendor defines
    Options map[string]string
}

Some options only make sense if they are applied to all GPUs from the same vendor. For instance, for us, the following would be non-sensical:

{ "nvidia", "0", {"capabilities": "compute"}}, { "nvidia", "1", {"capabilities": "video"}}

The capabilities option can't be different for each GPU. While we can enforce that from the CLI, we can't at the API-level. Of course, it's always possible to reject such a construct when the vendor-specific code iterates through its GPUs.

My API above assumes that other vendors will be in the same situation, so I wanted to be able to enforce that at the API level. With this knowledge, if you agree to put this burden on the vendor implementation, then your approach is fine.

cpuguy83 · 2018-07-12T23:36:20Z

@flx42 Thanks for the feedback! So when you say "all" of the GPU's, does that mean literally all GPU's on the system or all the selected GPUs?

Do we know anyone from AMD that would be able to provide input here as well?

flx42 · 2018-07-12T23:44:32Z

So when you say "all" of the GPU's, does that mean literally all GPU's on the system or all the selected GPUs?

All the selected GPUs for the nvidia vendor. Or in other words, this option applies at the container level.

Maybe @kad can chime in on the Intel side for OpenCL, maybe they could have a similar option for selecting/modifying the OpenCL ICD.

cpuguy83 · 2018-07-16T16:48:49Z

@flx42 It seems like your structs work well (but still would prefer a slice over a map in hostconfig, and leave it up to the implementation to validate). Perhaps we can change Resource to GPUSet?

flx42 · 2018-07-17T23:03:49Z

Just to confirm, to avoid re-triggering the CI with a push, are you OK with the following?

package container

type GPUSet struct {
        Vendor string // or "Type"
        GPUs    []GPU
        Options map[string]string
}

type GPU struct {
        ID string
}

type Resources struct {
        GPUs []*GPUSet
        [...]
}

cpuguy83 · 2018-07-17T23:05:35Z

Is Resources needed here? I would think we'd have []GPUSet in HostConfig.

flx42 · 2018-07-17T23:08:36Z

Well the existing Resources struct is embedded in HostConfig:

moby/api/types/container/host_config.go

Lines 398 to 399 in 25ec60d

    
           // Contains container's resources (cgroups, ulimits) 
        
           Resources

I think it makes sense to consider a GPU as a resource since that's where CPU, char/block devices, memory are.

flx42 · 2018-07-17T23:09:39Z

The Resources struct in #37434 (comment) is the existing one, in case it was not clear, sorry :)

cpuguy83 · 2018-07-17T23:09:55Z

Ah I see, yeah that's perfect.

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

flx42 · 2018-07-18T01:20:30Z

Made the changes, also added Options to struct GPU, even if the CLI proposal won't probably support per-GPU options initially.

Does anyone disagree with the current approach?

flx42 · 2018-07-18T01:46:10Z

Assuming we are OK with this design, I will go ahead with the corresponding implementation in the CLI (`--gpus) and in dockerd (for calling into the containerd features).
We can merge once we have everything.

codecov

Codecov Report

Merging #37434 into master will increase coverage by 0.46%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #37434      +/-   ##
==========================================
+ Coverage   35.31%   35.78%   +0.46%     
==========================================
  Files         616      617       +1     
  Lines       48851    51403    +2552     
==========================================
+ Hits        17254    18396    +1142     
- Misses      29225    30422    +1197     
- Partials     2372     2585     +213

olljanat · 2018-12-22T16:40:00Z

@yongtang looks that this one passes tests. PTAL

yongtang · 2018-12-22T16:57:34Z

/cc @cpuguy83

RenaudWasTaken · 2019-01-12T03:08:54Z

Hello!

I'll be taking over @flx42's PRs.

@cpuguy83 do you have any problems left with the current API? If not can we go ahead and merge this?
Thanks!

cpuguy83 · 2019-01-15T22:14:02Z

Mainly concerned about other GPU vendors chiming in, but no one has in all this time... moving to code review.

GordonTheTurtle added the status/0-triage label Jul 10, 2018

flx42 mentioned this pull request Jul 10, 2018

[RFC] GPU support in CLI docker/cli#1200

Closed

crosbymichael reviewed Jul 11, 2018

View reviewed changes

api/types/gpu/gpu.go Outdated

Copy link

Contributor

crosbymichael Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is pretty confusing ;)

flx42 force-pushed the api-add-gpu-resources branch 3 times, most recently from f45a887 to ad193a6 Compare July 12, 2018 19:00

cpuguy83 requested changes Jul 12, 2018

View reviewed changes

cpuguy83 added status/1-design-review and removed status/0-triage labels Jul 12, 2018

api: add a GPU resource type

b673450

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

flx42 force-pushed the api-add-gpu-resources branch from ad193a6 to b673450 Compare July 18, 2018 00:44

flx42 mentioned this pull request Jul 20, 2018

dockerd gpu support #37504

Closed

GordonTheTurtle assigned yongtang Jul 25, 2018

derek bot added the rebuild/janky label Dec 22, 2018

GordonTheTurtle removed the rebuild/janky label Dec 22, 2018

cpuguy83 added status/2-code-review and removed status/1-design-review labels Jan 15, 2019

thaJeztah mentioned this pull request Feb 13, 2019

Support OCI Image spec field Platform.features #38715

Open

5 tasks

thaJeztah mentioned this pull request Mar 5, 2019

Add DeviceRequests to HostConfig to support NVIDIA GPUs #38828

Merged

tiborvass closed this in #38828 Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] api: add a GPU resource type #37434

[WIP] api: add a GPU resource type #37434

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] api: add a GPU resource type #37434

[WIP] api: add a GPU resource type #37434

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!