10BC0 feat(trainer): Refactor get_job_logs() API with Iterator by andreyvelich · Pull Request #83 · kubeflow/sdk · GitHub
[go: up one dir, main page]

Skip to content

Conversation

andreyvelich
Copy link
Member

Fixes: #75

I updated the get_job_logs() to always return iterator from TrainJob's step logs.

/assign @kubeflow/kubeflow-sdk-team

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link
coveralls commented Sep 3, 2025

Pull Request Test Coverage Report for Build 17436449337

Details

  • 11 of 15 (73.33%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+5.1%) to 70.146%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 11 15 73.33%
Totals Coverage Status
Change from base Build 17382047367: 5.1%
Covered Lines: 289
Relevant Lines: 412

💛 - Coveralls

Copy link
Contributor
@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andreyvelich!
Great to see that code looks much cleaner with Iterator approach 🎉


step: str = constants.NODE + "-0",
) -> Iterator[str]:
"""Get the TrainJob logs"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this docstring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but this is just reminder how this API is used for developers and AI tools 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sounds good to me!


self.wait_for_job_status(job_name)
print(self.get_job_logs(job_name)["node-0"])
print(self.get_job_logs(job_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will print a generator object, not logs, right?
I think we should instead do:

for line in self.get_job_logs(job_name):
    print(line, end="")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update other references, for example README

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would print(*self.get_job_logs(job_name), sep="\n") work?

Otherwise, would there be a way to override the string representation of the returned iterator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it could, but I think we might run out of memory :) And there'll be no streaming cause we have to wait for iterator to finish, which works against the purpose of iterator

Why don't just use this?

for line in self.get_job_logs(job_name):
    print(line, end="")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and prefer the streaming approach, it was really to get a one liner :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it could, but I think we might run out of memory :)

Why we run out of memory ? Since we just print the pip list + nvidia-smi the log would be small.

Users can do something like this if they don't want to define loop:

print("\n".join(TrainerClient().get_job_logs(name=job_id)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we run out of memory ? Since we just print the pip list + nvidia-smi the log would be small.

Yeah, I meant when logs are large

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could have two APIs -- one for streaming and another for returning a complete string, similar to what Ray does

@andreyvelich @astefanutti any thoughts on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we consolidate it in the single get_job_logs() API, similar to how we consolidate BuiltinTrainer and CustomTrainer into train() API.
I don't see much value to separate them, since it is better to return Iterator[str] for both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

if all(finished):
# Stream logs incrementally
for logline in log_stream:
if logline is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually yield None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can just do this:

  if logline is None:
      return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, IIUC it never yields None https://github.com/kubernetes-client/python/blob/master/kubernetes/base/watch/watch.py#L213-L216

We could just do

for logline in log_stream:
    yield logline

name=pod_name,
namespace=self.namespace,
container=constants.NODE,
container=re.sub(r"-\d+$", "", step), # Remove the number for the node step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we call this twice, cold just count once and reuse it

log_streams = []
log_streams.append(
watch.Watch().stream(
return iter([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we raise RuntimeError or log a warning in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should raise an Exception here, since at this stage TrainJob is not yet produced logs, so we just return empty logs.
@astefanutti thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense to me to return an empty iterator then

"""Test TrainerClient.get_job_logs with basic success path."""
print("Executing test:", test_case.name)
try:
trainer_client.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be from backend, right?

Suggested change
trainer_client.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE)
trainer_client.backend.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Trainer Client here is the Kubernetes backend, not TrainerClient()

yield KubernetesBackend(KubernetesBackendConfig())

Let me rename it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, thank you!


self.wait_for_job_status(job_name)
print(self.get_job_logs(job_name)["node-0"])
print(self.get_job_logs(job_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would print(*self.get_job_logs(job_name), sep="\n") work?

Otherwise, would there be a way to override the string representation of the returned iterator?

for index, log_queue in enumerate(log_queue_pool):
if all(finished):
# Stream logs incrementally
for logline in log_stream:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure each item are entire lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, log_stream is <generator object Watch.stream at 0x1073c6340> object, and we can return it line by line.

Copy link
Contributor
@astefanutti astefanutti Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked and it indeed does yield items line by line: ttps://github.com/kubernetes-client/python/blob/6e7c539f52dec4e993d2c32a4408920d8522f47e/kubernetes/base/watch/watch.py#L54-L83

I wasn't sure whether we had to do it ourselves or not.

andreyvelich and others added 3 commits September 3, 2025 12:02
Co-authored-by: Anya Kramar <akramar@redhat.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

@astefanutti @kramaranya I will open PRs to update Trainer examples + website today.
Since this is breaking change to the existing examples and E2Es are failing, are we ok to merge this PR manually ?

@kramaranya
Copy link
Contributor

@astefanutti @kramaranya I will open PRs to update Trainer examples + website today. Since this is breaking change to the existing examples and E2Es are failing, are we ok to merge this PR manually ?

Sure, makes sense

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@kramaranya
Copy link
Contributor

Thank you!
/lgtm

Copy link
Member Author
@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

Looks like E2Es are succeed after the change.
We don't need to manually merge this PR 🎉

@google-oss-prow google-oss-prow bot merged commit 0f7a988 into kubeflow:main Sep 3, 2025
10 of 14 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Sep 3, 2025

# Print TrainJob logs
print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it worth showing follow=True here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe 1004E this comment to others. Learn more.

By default wait_for_job_status() waits until TrainJob is complete, so showing example with follow is unnecessary here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, missed this part

@andreyvelich andreyvelich deleted the get-job-logs-refactor branch September 3, 2025 15:14
jaiakash added a commit to jaiakash/trainer that referenced this pull request Sep 3, 2025
kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
google-oss-prow bot pushed a commit to kubeflow/trainer that referenced this pull request Sep 4, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
tdn21 pushed a commit to tdn21/trainer that referenced this pull request Sep 6, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Oct 4, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Oct 4, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor get_job_logs() API for Kubeflow Trainer
4 participants
0