trl

Examples

Requirements

First, you should install the requirements:

$ pip install -U -r requirements.txt

Supervised Finetuning

The following example is for the supervised Lora finetune with Qwen2 model for conversational format dataset.

python sft.py \
    --model_name_or_path "Qwen/Qwen2-7B" \
    --dataset_name "philschmid/dolly-15k-oai-style" \
    --streaming False \
    --bf16 True \
    --subset '' \
    --output_dir ./model_qwen \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --eval_strategy "no" \
    --save_strategy "no" \
    --learning_rate 3e-4 \
    --warmup_ratio  0.03 \
    --lr_scheduler_type "cosine" \
    --max_grad_norm  0.3 \
    --logging_steps 1 \
    --do_train \
    --do_eval \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --use_peft True \
    --lora_r 4 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
    --max_seq_length 512 \
    --adam_epsilon 1e-08 \
    --use_flash_attention

Supervised fine-tuning of the mistralai/Mixtral-8x7B-Instruct-v0.1 on 4 cards:

DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 4 --use_deepspeed sft.py \
    --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --dataset_name "philschmid/dolly-15k-oai-style" \
    --subset 'data/' \
    --streaming False \
    --deepspeed ../language-modeling/llama2_ds_zero3_config.json \
    --output_dir="./model_mixtral" \
    --do_train \
    --max_steps=500 \
    --logging_steps=10 \
    --save_steps=100 \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --lr_scheduler_type="cosine" \
    --warmup_steps=100 \
    --weight_decay=0.05 \
    --optim="paged_adamw_32bit" \
    --lora_target_modules "q_proj" "v_proj" \
    --bf16 \
    --remove_unused_columns=False \
    --max_seq_length 512 \
    --run_name="sft_mixtral" \
    --report_to=none \
    --use_habana \
    --use_lazy_mode

DPO pipeline

Training

For meta-llama/Llama-2-70b-hf

The following example is for the creation of StackLlaMa 2: a Stack exchange llama-v2-70b model. There are two main steps to the DPO training process.

For large model like Llama2-70B, we could use DeepSpeed Zero-3 to enable DPO training in multi-card. steps like:

Supervised fine-tuning of the base llama-v2-70b model to create llama-v2-70b-se:

DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py \
    --model_name_or_path meta-llama/Llama-2-70b-hf \
    --dataset_name "lvwerra/stack-exchange-paired" \
    --deepspeed ../language-modeling/llama2_ds_zero3_config.json \
    --output_dir="./sft" \
    --do_train \
    --max_steps=500 \
    --logging_steps=10 \
    --save_steps=100 \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --lr_scheduler_type="cosine" \
    --warmup_steps=100 \
    --weight_decay=0.05 \
    --optim="paged_adamw_32bit" \
    --lora_target_modules "q_proj" "v_proj" \
    --bf16 \
    --remove_unused_columns=False \
    --run_name="sft_llama2" \
    --report_to=none \
    --use_habana \
    --use_lazy_mode

To merge the adaptors to get the final sft merged checkpoint, we can use the merge_peft_adapter.py helper script that comes with TRL:

python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-70b-hf" --adapter_model_name="sft" --output_name="sft/final_merged_checkpoint"

Run the DPO trainer using the model saved by the previous step:

DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed dpo.py \
    --model_name_or_path="sft/final_merged_checkpoint" \
    --tokenizer_name_or_path=meta-llama/Llama-2-70b-hf \
    --deepspeed ../language-modeling/llama2_ds_zero3_config.json \
    --lora_target_modules "q_proj" "v_proj" "k_proj" "out_proj" "fc_in" "fc_out" "wte" \
    --output_dir="dpo" \
    --max_prompt_length=256 \
    --max_length=512 \
    --report_to=none

Merging the adaptors

To merge the adaptors into the base model we can use the merge_peft_adapter.py helper script that comes with TRL:

python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-70b-hf" --adapter_model_name="dpo" --output_name="stack-llama-2"

which will also push the model to your HuggingFace hub account.

Running the model

We can load the DPO-trained LoRA adaptors which were saved by the DPO training step and run it through the text-generation example.

python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_generation.py \
--model_name_or_path ../trl/stack-llama-2/ \
--use_hpu_graphs --use_kv_cache --batch_size 1 --bf16 --max_new_tokens 100 \
--prompt "Here is my prompt"

PPO pipeline

Training

The following example is for the creation of StackLlaMa 2: a Stack exchange llama-v2-7b model. There are three main steps to the PPO training process:

Supervised fine-tuning of the base llama-v2-7b model to create llama-v2-7b-se:

python ../gaudi_spawn.py --world_size 8 --use_mpi sft.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name "lvwerra/stack-exchange-paired" \
    --output_dir="./sft" \
    --do_train \
    --max_steps=500 \
    --logging_steps=10 \
    --save_steps=100 \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --lr_scheduler_type="cosine" \
    --warmup_steps=100 \
    --weight_decay=0.05 \
    --optim="paged_adamw_32bit" \
    --lora_target_modules "q_proj" "v_proj" \
    --bf16 \
    --remove_unused_columns=False \
    --run_name="sft_llama2" \
    --report_to=none \
    --use_habana \
    --use_lazy_mode

To merge the adaptors to get the final sft merged checkpoint, we can use the merge_peft_adapter.py helper script that comes with TRL:

python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-7b-hf" --adapter_model_name="sft" --output_name="sft/final_merged_checkpoint"

Reward modeling using dialog pairs from the SE dataset on the llama-v2-7b-se to create llama-v2-7b-se-rm

python ../gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py \
    --model_name_or_path=./sft/final_merged_checkpoint \
    --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \
    --output_dir=./rm

To merge the adaptors into the base model we can use the merge_peft_adapter.py helper script that comes with TRL:

python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-7b-hf" --adapter_model_name="rm" --output_name="rm_merged_checkpoint"

RL fine-tuning of llama-v2-7b-se with the llama-v2-7b-se-rm reward model:

python ../gaudi_spawn.py --world_size 8 --use_mpi ppo.py \
    --model_name_or_path=./sft/final_merged_checkpoint \
    --reward_model_name=./rm_merged_checkpoint \
    --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \
    --adafactor=False \
    --output_max_length=128 \
    --batch_size=8 \
    --gradient_accumulation_steps=8 \
    --batched_gen=True \
    --ppo_epochs=4 \
    --seed=0 \
    --learning_rate=1.4e-5 \
    --early_stopping=True \
    --output_dir=llama-se-rl-finetune

To merge the adaptors into the base model we can use the merge_peft_adapter.py helper script that comes with TRL:

python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-7b-hf" --adapter_model_name="llama-se-rl-finetune" --output_name="rl_merged_checkpoint"

Running the model

We can load the PPO-trained LoRA adaptors which were saved by the PPO training step and run it through the text-generation example.

python run_generation.py \
--model_name_or_path ../trl/rl_merged_checkpoint/ \
--use_hpu_graphs --use_kv_cache --batch_size 1 --bf16 --max_new_tokens 100 \
--prompt "Here is my prompt"

DDPO pipeline

Training

The following example is for fine-tuning stable diffusion using Denoising Diffusion Policy Optimization (DDPO). The implementation supports LoRA and non-LoRA-based training. LoRA based training is faster and less finicky to converge than non-LoRA based training. Recommendations for non-Lora based training (described here) are setting the learning rate relatively low (e.g., 1e-5) and disabling mixed precision training. HPU graphs are enabled by default for better performance.

There are two main steps to the DDPO training process:

Fine-tuning of the base stable-diffusion model with LoRA to create ddpo-aesthetic-predictor:

python ddpo.py \
  --num_epochs=200 \
  --train_gradient_accumulation_steps=1 \
  --sample_num_steps=50 \
  --sample_batch_size=6 \
  --train_batch_size=3 \
  --sample_num_batches_per_epoch=4 \
  --per_prompt_stat_tracking=True \
  --per_prompt_stat_tracking_buffer_size=32 \
  --train_learning_rate=1e-05 \
  --tracker_project_name="stable_diffusion_training" \
  --log_with="tensorboard" \
  --use_habana \
  --use_hpu_graphs \
  --bf16 \
  --hf_hub_model_id="ddpo-finetuned-stable-diffusion" \
  --push_to_hub False \
  --sdp_on_bf16

Note

Due to a known issue on Gaudi3, sample_batch_sizes should be changed to 3. The issue will be fixed in the future release.

Inference using the fine-tuned LoRA weights as shown in the example below:

import torch

from optimum.habana import GaudiConfig
from optimum.habana.trl import GaudiDefaultDDPOStableDiffusionPipeline

gaudi_config = GaudiConfig.from_pretrained("Habana/stable-diffusion")
model_id = "CompVis/stable-diffusion-v1-4"
lora_model_id = "ddpo-finetuned-stable-diffusion"
pipeline = GaudiDefaultDDPOStableDiffusionPipeline(
    model_id,
    use_habana=True,
    use_hpu_graphs=True,
    gaudi_config=gaudi_config,
)
pipeline.sd_pipeline.load_lora_weights(lora_model_id)
device = torch.device("hpu")

# memory optimization
pipeline.vae.to(device, torch.bfloat16)
pipeline.text_encoder.to(device, torch.bfloat16)
pipeline.unet.to(device, torch.bfloat16)

prompts = ["lion", "squirrel", "crab", "starfish", "whale", "sponge", "plankton"]
results = pipeline(prompts)

for prompt, image in zip(prompts, results.images):
    image.save(f"{prompt}.png")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trl

trl

README.md

Examples

Requirements

Supervised Finetuning

DPO pipeline

Training

For meta-llama/Llama-2-70b-hf

Merging the adaptors

Running the model

PPO pipeline

Training

Running the model

DDPO pipeline

Training

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
ddpo.py		ddpo.py
dpo.py		dpo.py
merge_peft_adapter.py		merge_peft_adapter.py
ppo.py		ppo.py
requirements.txt		requirements.txt
reward_modeling.py		reward_modeling.py
sft.py		sft.py

Files

trl

Directory actions

More options

Directory actions

More options

Latest commit

History

trl

Folders and files

parent directory

README.md

Examples

Requirements

Supervised Finetuning

DPO pipeline

Training

For meta-llama/Llama-2-70b-hf

Merging the adaptors

Running the model

PPO pipeline

Training

Running the model

DDPO pipeline

Training