E5FA Releases · kubeflow/trainer · GitHub
[go: up one dir, main page]

Skip to content

Releases: kubeflow/trainer

v2.0.1

29 Sep 14:24
Compare
Choose a tag to compare

This is Kubeflow Trainer v2.0.1 release.

New Features

  • [release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

v2.0.0

21 Jul 15:59
Compare
Choose a tag to compare

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Read more

v1.9.3

17 Jul 14:48
c77ee3f
Compare
Choose a tag to compare

This is the Training Operator v1.9.3 release.

New Features

Misc

v2.0.0-rc.1

05 Jul 23:52
Compare
Choose a tag to compare
v2.0.0-rc.1 Pre-release
Pre-release

This is the Kubeflow Trainer v2.0.0-rc.1 pre-release.

New Features

  • [release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
  • [release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
  • [Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

  • [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
  • [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
  • [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
  • [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

  • [release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
  • [cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
  • [release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
  • [release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
  • [release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
  • [release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

v2.0.0-rc.0

12 Jun 12:00
Compare
Choose a tag to compare
v2.0.0-rc.0 Pre-release
Pre-release

This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Read more

v1.9.2

03 May 02:43
bde9c20
Compare
Choose a tag to compare

This is the Training Operator v1.9.2 release.

New Features

Bug Fixes

v1.9.1 release

31 Mar 23:09
17077e3
Compare
Choose a tag to compare

This is the Training Operator v1.9.1 release.

Breaking Changes

New Features

  • Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
  • Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)

Bug Fixes

v1.9.0 release

28 Jan 15:58
6f74c7f
Compare
Choose a tag to compare

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Read more

v1.9.0-rc.0 release

10 Jan 23:27
a0ae3b1
Compare
Choose a tag to compare
v1.9.0-rc.0 release Pre-release
Pre-release

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Read more

v1.8.1 release

10 Sep 15:14
Compare
Choose a tag to compare

This is the Training Operator v1.8.1 release.

Bug Fixes

  • [Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
  • [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
  • Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

0