8000 [WIP] api: add a GPU resource type by flx42 · Pull Request #37434 · moby/moby · GitHub
[go: up one dir, main page]

Skip to content

Conversation

flx42
Copy link
Contributor
@flx42 flx42 commented Jul 10, 2018

CLI discussion: docker/cli#1200

First draft of what the API might look like. I don't believe we need something more complex than that.

Discussion points

  • Since in the CLI discussion we agreed to namespace GPUs per vendor, naturally the entrypoint for the API is a map to access all the GPUs/options for a single vendor. We could have a single slice of GPUs and a single option map, but it feel awkward if a vendor needs to skip devices/options that it doesn't own.
  • Naming is hard, so Resources (plural) is all the GPUs from all vendor. Resource is all the GPUs from a single vendor. Thoughts?
  • Let's not support per-GPU options for now, but we can add that to the GPU struct later without breaking existing code.
  • In the CLI discussion we discussed a shortcut to select all GPUs: --gpus nvidia. However at this point it's not possible to get the actual list of GPU ids. We have two options: a) if gpu.Resource.GPUs is empty, then select all GPUs OR b) if gpu.Resource.GPUs[0].ID == "*" then select all GPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is pretty confusing ;)

@cpuguy83
Copy link
Member

There's already some work (in swarmkit) around "GenericResources".
I wonder if that can, or should, be brought over into the runtime side for this.

https://godoc.org/github.com/docker/docker/api/types/swarm#GenericResource

It's a bit more abstract because it's designed to be pretty loose as resources are defined by admins for a given node and the scheduler matches the resources defined on the node with resources defined in the service.... so maybe not necessarily the best fit.

@flx42
Copy link
Contributor Author
flx42 commented Jul 11, 2018

Yes, the GenericResource support was added in swarmkit by my colleague @RenaudWasTaken. As you said this existing API seems to be a better fit for scheduling/matching purposes, and we'll lose the per-vendor separation that I think is useful for API users; but on the other hand this type might be reused for other purposes in the future too.

But those are not major problems, we could work with the existing GenericResource type, if we also attach option to each resource.

@flx42
Copy link
Contributor Author
flx42 commented Jul 11, 2018

So the fundamental question here is:

  • Do we want a GPU resource type? Or do we want a more generic API similar to Swarm?

I'm fine either ways.

@cpuguy83
Copy link
Member

I think a GPU type works fine and is much simpler to deal with.

@flx42 flx42 force-pushed the api-add-gpu-resources branch 3 times, most recently from f45a887 to ad193a6 Compare July 12, 2018 19:00
@flx42
Copy link
Contributor Author
flx42 commented Jul 12, 2018

Updated my PR, we now have the following:

package container

type Resources struct {
        GPUsByVendor map[gpu.Vendor]*gpu.Resources
        [...]
}
package gpu

type Vendor string

type Resources struct {
        GPUs    []GPU
        Options map[string]string
}

type GPU struct {
        ID string
}

I don't think we need more than that today. If we agree on the API, I can start to flesh out the next step: integration in dockerd by transforming these GPU resources into OCI annotations, then calling into the new containerd API.

Copy link
Member
@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what makes sense from the GPU side of things, but I'd like to make sure the API does not limit how we can use this later on (or to what the proposed docker CLI solution can handle).

Would make make sense to have the API be GPUs []gpu.Resource where a gpu.Resource is:

package gpu

type Resource {
    Type string // e.g. 'nvidia'
    ID string // GPU UUID... whatever the vendor defines
    Options map[string]string
}

Also this may not require it's own package since there's not really much to it.

Anyway as I said I'm not sure what makes sense from the vendor perspective here but this seems like it makes the API a bit more extensible and the complexity of it can be handled by the client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limits the API quite a bit.
What do you think about using a slice here and then we can validate as needed in code.

@flx42
Copy link
Contributor Author
flx42 commented Jul 12, 2018

I have a concern with this approach:

type Resource {
    Type string // e.g. 'nvidia'
    ID string // GPU UUID... whatever the vendor defines
    Options map[string]string
}

Some options only make sense if they are applied to all GPUs from the same vendor. For instance, for us, the following would be non-sensical:

{ "nvidia", "0", {"capabilities": "compute"}}, { "nvidia", "1", {"capabilities": "video"}}

The capabilities option can't be different for each GPU. While we can enforce that from the CLI, we can't at the API-level. Of course, it's always possible to reject such a construct when the vendor-specific code iterates through its GPUs.

My API above assumes that other vendors will be in the same situation, so I wanted to be able to enforce that at the API level. With this knowledge, if you agree to put this burden on the vendor implementation, then your approach is fine.

@cpuguy83
Copy link
Member

@flx42 Thanks for the feedback! So when you say "all" of the GPU's, does that mean literally all GPU's on the system or all the selected GPUs?

Do we know anyone from AMD that would be able to provide input here as well?

@flx42
Copy link
Contributor Author
flx42 commented Jul 12, 2018

So when you say "all" of the GPU's, does that mean literally all GPU's on the system or all the selected GPUs?

All the selected GPUs for the nvidia vendor. Or in other words, this option applies at the container level.

Maybe @kad can chime in on the Intel side for OpenCL, maybe they could have a similar option for selecting/modifying the OpenCL ICD.

@cpuguy83
Copy link
Member

@flx42 It seems like your structs work well (but still would prefer a slice over a map in hostconfig, and leave it up to the implementation to validate). Perhaps we can change Resource to GPUSet?

@flx42
Copy link
Contributor Author
flx42 commented Jul 17, 2018

Just to confirm, to avoid re-triggering the CI with a push, are you OK with the following?

package container

type GPUSet struct {
        Vendor string // or "Type"
        GPUs    []GPU
        Options map[string]string
}

type GPU struct {
        ID string
}

type Resources struct {
        GPUs []*GPUSet
        [...]
}

@cpuguy83
Copy link
Member
cpuguy83 commented Jul 17, 2018

Is Resources needed here? I would think we'd have []GPUSet in HostConfig.

@flx42
Copy link
Contributor Author
flx42 commented Jul 17, 2018

Well the existing Resources struct is embedded in HostConfig:

// Contains container's resources (cgroups, ulimits)
Resources

I think it makes sense to consider a GPU as a resource since that's where CPU, char/block devices, memory are.

@flx42
Copy link
Contributor Author
flx42 commented Jul 17, 2018

The Resources struct in #37434 (comment) is the existing one, in case it was not clear, sorry :)

@cpuguy83
Copy link
Member

Ah I see, yeah that's perfect.

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
@flx42 flx42 force-pushed the api-add-gpu-resources branch from ad193a6 to b673450 Compare July 18, 2018 00:44
@flx42
Copy link
Contributor Author
flx42 commented Jul 18, 2018

Made the changes, also added Options to struct GPU, even if the CLI proposal won't probably support per-GPU options initially.

Does anyone disagree with the current approach?

@flx42
Copy link
Contributor Author
flx42 commented Jul 18, 2018

Assuming we are OK with this design, I will go ahead with the corresponding implementation in the CLI (`--gpus) and in dockerd (for calling into the containerd features).
We can merge once we have everything.

@codecov
Copy link
codecov bot commented Dec 22, 2018

Codecov Report

Merging #37434 into master will increase coverage by 0.46%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #37434      +/-   ##
==========================================
+ Coverage   35.31%   35.78%   +0.46%     
==========================================
  Files         616      617       +1     
  Lines       48851    51403    +2552     
==========================================
+ Hits        17254    18396    +1142     
- Misses      29225    30422    +1197     
- Partials     2372     2585     +213

@olljanat
Copy link
Contributor

@yongtang looks that this one passes tests. PTAL

@yongtang
Copy link
Member

/cc @cpuguy83

@RenaudWasTaken
Copy link
Contributor
A36C

Hello!

I'll be taking over @flx42's PRs.

@cpuguy83 do you have any problems left with the current API? If not can we go ahead and merge this?
Thanks!

@cpuguy83
Copy link
Member

Mainly concerned about other GPU vendors chiming in, but no one has in all this time... moving to code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0