8000 feat(runtimes): Add Framework Label to the Runtimes by andreyvelich · Pull Request #2761 · kubeflow/trainer · GitHub
[go: up one dir, main page]

Skip to content

Conversation

andreyvelich
Copy link
Member
@andreyvelich andreyvelich commented Jul 30, 2025

As we discussed in Slack and GitHub, we would like to introduce this label to the runtime to define ML Framework:

trainer.kubeflow.org/framework

Ref: kubeflow/sdk#31 (comment),
https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753710956860929

/assign @kubeflow/kubeflow-trainer-team @astefanutti @kramaranya

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link
coveralls commented Jul 30, 2025

Pull Request Test Coverage Report for Build 16645963859

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 0.0%

Totals Coverage Status
Change from base Build 16592909581: 0.0%
Covered Lines: 0
Relevant Lines: 0

💛 - Coveralls

Copy link
Contributor
@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!
/lgtm

Out of curiosity, what are the plans with this mpi runtime?

# TODO (andreyvelich): Change this to DeepSpeed or MLX runtime.

metadata:
name: deepspeed-distributed
labels:
trainer.kubeflow.org/trainer-type: custom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, we don't think the trainer type can be safely inferred in the SDK from the framework label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can if we define the mapping of supported builtin trainers in the SDK.
Shall we try to do that initially @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd be inclined to try that so we keep what has to be exposed on the training runtimes minimal.

Copy link
Member
@Electronic-Waste Electronic-Waste Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for yaml users, they'll use the runtime without trainer.kubeflow.org/trainer-type label? Is this label only intended for the validation in SDK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for yaml users,

@Electronic-Waste I don't think that this is needed for YAML users.
If users are familiar with kubectl, they can always check the TrainJob and TrainingRuntimeSpec by themself.
Also, it is very tricky to use TorchTune runtimes without SDK, since user doesn't know which parameters they can specify (e.g. TorchTuneConfig)

@andreyvelich
Copy link
Member Author

Thank you!! /lgtm

Out of curiosity, what are the plans with this mpi runtime?

# TODO (andreyvelich): Change this to DeepSpeed or MLX runtime.

We have WIP PR to remove it: #2760, we still discuss how to define deprecation strategy for the runtimes.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Jul 31, 2025
@andreyvelich andreyvelich changed the title feat(runtimes): Add Trainer Type and Framework Labels feat(runtimes): Add Framework Label to the Runtimes Jul 31, 2025
@astefanutti
Copy link
Contributor

/lgtm

Copy link
Member
@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit aa46e02 into kubeflow:master Jul 31, 2025
19 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Jul 31, 2025
@andreyvelich andreyvelich deleted the add-runtime-labels branch July 31, 2025 11:30
astefanutti pushed a commit to astefanutti/training-operator that referenced this pull request Sep 24, 2025
* feat(runtimes): Add Trainer Type and Framework Labels

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove trainer type from the labels

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
andreyvelich added a commit that referenced this pull request Sep 24, 2025
* feat(runtimes): Add Trainer Type and Framework Labels



* Remove trainer type from the labels



---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0