-
Notifications
You must be signed in to change notification settings - Fork 35
feat(trainer): Refactor get_job_logs() API with Iterator #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(trainer): Refactor get_job_logs() API with Iterator #83
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Pull Request Test Coverage Report for Build 17436449337Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andreyvelich!
Great to see that code looks much cleaner with Iterator approach 🎉
|
||
step: str = constants.NODE + "-0", | ||
) -> Iterator[str]: | ||
"""Get the TrainJob logs""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this docstring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, but this is just reminder how this API is used for developers and AI tools 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sounds good to me!
|
||
self.wait_for_job_status(job_name) | ||
print(self.get_job_logs(job_name)["node-0"]) | ||
print(self.get_job_logs(job_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will print a generator object, not logs, right?
I think we should instead do:
for line in self.get_job_logs(job_name):
print(line, end="")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to update other references, for example README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would print(*self.get_job_logs(job_name), sep="\n")
work?
Otherwise, would there be a way to override the string representation of the returned iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it could, but I think we might run out of memory :) And there'll be no streaming cause we have to wait for iterator to finish, which works against the purpose of iterator
Why don't just use this?
for line in self.get_job_logs(job_name):
print(line, end="")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and prefer the streaming approach, it was really to get a one liner :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it could, but I think we might run out of memory :)
Why we run out of memory ? Since we just print the pip list
+ nvidia-smi
the log would be small.
Users can do something like this if they don't want to define loop:
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we run out of memory ? Since we just print the pip list + nvidia-smi the log would be small.
Yeah, I meant when logs are large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could have two APIs -- one for streaming and another for returning a complete string, similar to what Ray does
@andreyvelich @astefanutti any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we consolidate it in the single get_job_logs()
API, similar to how we consolidate BuiltinTrainer and CustomTrainer into train()
API.
I don't see much value to separate them, since it is better to return Iterator[str] for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
if all(finished): | ||
# Stream logs incrementally | ||
for logline in log_stream: | ||
if logline is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this actually yield None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we can just do this:
if logline is None:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, IIUC it never yields None https://github.com/kubernetes-client/python/blob/master/kubernetes/base/watch/watch.py#L213-L216
We could just do
for logline in log_stream:
yield logline
name=pod_name, | ||
namespace=self.namespace, | ||
container=constants.NODE, | ||
container=re.sub(r"-\d+$", "", step), # Remove the number for the node step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we call this twice, cold just count once and reuse it
log_streams = [] | ||
log_streams.append( | ||
watch.Watch().stream( | ||
return iter([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we raise RuntimeError or log a warning in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should raise an Exception here, since at this stage TrainJob is not yet produced logs, so we just return empty logs.
@astefanutti thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense to me to return an empty iterator then
"""Test TrainerClient.get_job_logs with basic success path.""" | ||
print("Executing test:", test_case.name) | ||
try: | ||
trainer_client.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be from backend, right?
trainer_client.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE) | |
trainer_client.backend.namespace = test_case.config.get("namespace", DEFAULT_NAMESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Trainer Client here is the Kubernetes backend, not TrainerClient()
yield KubernetesBackend(KubernetesBackendConfig()) |
Let me rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, thank you!
|
||
self.wait_for_job_status(job_name) | ||
print(self.get_job_logs(job_name)["node-0"]) | ||
print(self.get_job_logs(job_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would print(*self.get_job_logs(job_name), sep="\n")
work?
Otherwise, would there be a way to override the string representation of the returned iterator?
for index, log_queue in enumerate(log_queue_pool): | ||
if all(finished): | ||
# Stream logs incrementally | ||
for logline in log_stream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure each item are entire lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, log_stream is <generator object Watch.stream at 0x1073c6340>
object, and we can return it line by line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked and it indeed does yield items line by line: ttps://github.com/kubernetes-client/python/blob/6e7c539f52dec4e993d2c32a4408920d8522f47e/kubernetes/base/watch/watch.py#L54-L83
I wasn't sure whether we had to do it ourselves or not.
Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@astefanutti @kramaranya I will open PRs to update Trainer examples + website today. |
Sure, makes sense |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Looks like E2Es are succeed after the change. |
|
||
# Print TrainJob logs | ||
print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"]) | ||
print("\n".join(TrainerClient().get_job_logs(name=job_id))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it worth showing follow=True
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe 1004E this comment to others. Learn more.
By default wait_for_job_status()
waits until TrainJob is complete, so showing example with follow is unnecessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, missed this part
kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Fixes: #75
I updated the
get_job_logs()
to always return iterator from TrainJob's step logs./assign @kubeflow/kubeflow-sdk-team