Breaking Changes

@kaisoz

This is Kubeflow Trainer v2.0.1 release.

New Features

[release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

[release-2.0] fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2863 by @andreyvelich)
[release-2.0] fix(ci): Add latest image tag only for the master branch (#2862 by @andreyvelich)
[release-2.0] fix: update examples to reflect func_args now being unpacked (#2815) (#2853 by @astefanutti)
[release-2.0] fix(examples): Update get_job_logs() API in examples (#2813) (#2852 by @astefanutti)
[release-2.0] feat(runtimes): Add Framework Label to the Runtimes (#2761) (#2851 by @astefanutti)
[release-2.0] fix(examples): Update the argument for Runtime framework (#2766) (#2850 by @astefanutti)
[release-2.0] fix: update kubeflow sdk reference (#2780) (#2847 by @astefanutti)
[release-2.0] fix(api): Fix license path for Kubeflow Trainer Python API (#2772 by @andreyvelich)

@eoinfennessy

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

Migrate SDK to the kubeflow/sdk repository (#2657 by @eoinfennessy)
KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

[release-2.0] fix(manifests): add rbac config of events for event recorders (#2733 by @rudeigerc)
[release-2.0] fix(manifests): fix position of labels of dataset-initializer from pod to job (#2720 by @rudeigerc)
[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint ([#2489](https://github.com/kubeflow/trainer/pull...

@abhijeet-dhumal

This is the Training Operator v1.9.3 release.

New Features

[SDK] Add provision to provide local-queue for the training job (#2636 by @abhijeet-dhumal

Misc

chore: Remove V2 code from Training Operator 1.9 release branch (#2737 by @andreyvelich
chore(ci): Add more workaround no space left on device (#2677 by @astefanutti

@astefanutti

This is the Kubeflow Trainer v2.0.0-rc.1 pre-release.

New Features

[release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
[release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
[Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

[release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
[cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
[release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
[release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
[release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
[release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

@Electronic-Waste

This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.

Breaking Changes

KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

Remove SDK (#2657 by @eoinfennessy)
feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint (#2489 by @szaher)
fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
Fix missing external types in apply configurations (#2429 by @astefanutti)
Fix API Group for Torch Runtime (#2424 by @andreyvelich)
Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

Tag Docker images with GitHub release tags (#2662 by @kramaranya)
feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
[chore] update stale action version to latest (#2642 by @mahdikhashan)
Remove TrainJobCreated condition (#2621 by @astefanutti)
ci: refactor build-push-images workflow (#2607 by @milinddethe15)
Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
ci: add k8s v1.32 for tests env ([#2613](#26...

@abhijeet-dhumal

This is the Training Operator v1.9.2 release.

New Features

Add provision to provide labels and annotations for the pytorchjob an… (#2612 by @abhijeet-dhumal)

Bug Fixes

Fix llm hp optimization error (#2576 by @helenxie-bit)
[bug] pull image from ghcr (#2584 by @mahdikhashan)

@saileshd1402

This is the Training Operator v1.9.1 release.

Breaking Changes

Update Manifest Images to GHCR (#2544 by @saileshd1402)
Push images to GHCR for release-1.9 (#2491 by @saileshd1402)

New Features

Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)

Bug Fixes

fix(ci): Change publish dir from training to trainer (#2546 by @Electronic-Waste)
fix: fix typos in script comments. (#2465 by @IRONICBo)
fix: adds jaxjobs to the kubeflow-training-roles.yaml ClusterRole (#2417 by @DnPlas)
[release-1.9] Rename paddlepaddle_defaults.go file name (#2400 by @ChristianZaccaria)

@astefanutti

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
JAX example for MNIST SPMD and add CI testing (#2390 by @saileshd1402)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Trainer V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Imple 10BC0 mentations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken ([#2356](https://github.com/kubeflow/training-operator/pul...

@astefanutti

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV F5AE related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook ([#2274](https://github.com/kubeflow/training-operator...

@mszadkow

This is the Training Operator v1.8.1 release.

Bug Fixes

[Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

@mszadkow made their first contribution in #2243
@helenxie-bit made their first contribution in #2180

Releases: kubeflow/trainer

v2.0.1

New Features

Bug Fixes

Contributors

Uh oh!

v2.0.0

Quickstart

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Contributors

Uh oh!

v1.9.3

New Features

Misc

Contributors

Uh oh!

v2.0.0-rc.1

New Features

Bug Fixes

Misc

Contributors

Uh oh!

v2.0.0-rc.0

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Contributors

Uh oh!

v1.9.2

New Features

Bug Fixes

Contributors

Uh oh!

v1.9.1 release

Breaking Changes

New Features

Bug Fixes

Contributors

Uh oh!

v1.9.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Contributors

Uh oh!

v1.9.0-rc.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Contributors

Uh oh!

v1.8.1 release

Bug Fixes