8000 avoid saving boft_P in adapter model by sywangyi · Pull Request #2050 · huggingface/peft · GitHub
[go: up one dir, main page]

Skip to content

avoid saving boft_P in adapter model #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

sywangyi
Copy link
Contributor
@sywangyi sywangyi commented Sep 5, 2024

No description provided.

@sywangyi
Copy link
Contributor Author
sywangyi commented Sep 5, 2024

you could use https://github.com/huggingface/peft/blob/main/examples/boft_dreambooth/train_dreambooth.sh to reproduce, wo the PR. the unet adapter size is 485995200B, with the PR, the adapter size drop to 54387152

training params from the output is
trainable params: 13,611,904 || all params: 879,497,668 || trainable%: 1.5477

training precision is float32

@sywangyi
Copy link
Contributor Author
sywangyi commented Sep 5, 2024

@BenjaminBossan please help review

Copy link
Member
@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks for providing a fix. Indeed, I think that the boft_P buffer needs not be saved. Pinging @Zeju1997 to kindly ask to check this as well.

Let's also add a unit test for this. I tested locally using this:

import torch
from peft import BOFTConfig, PeftModel, get_peft_model
from safetensors.torch import load_file
from transformers import AutoModelForCausalLM


class TestBoft:
    def test_boft_state_dict(self, tmp_path):
        # see #2050
        # ensure that the boft_P buffer is not stored in the checkpoint file and is not necessary to load the model
        # correctly
        torch.manual_seed(0)

        inputs = torch.arange(10).view(-1, 1).to(0)
        model_id = "hf-internal-testing/tiny-random-OPTForCausalLM"
        model = AutoModelForCausalLM.from_pretrained(model_id).to(0)
        model.eval();
        output_base = model(inputs).logits

        config = BOFTConfig(init_weights=False)
        model = get_peft_model(model, config)
        model.eval();
        output_peft = model(inputs).logits

        atol, rtol = 1e-9, 1e-9
        # sanity check: loading boft changed the output
        assert not torch.allclose(output_base, output_peft, atol=atol, rtol=rtol)

        model.save_pretrained(tmp_path)
        del model

        # check that the boft_P buffer is not present
        state_dict = load_file(tmp_path / "adapter_model.safetensors")
        assert not any("boft_P" in key for key in state_dict)

        # sanity check: the model still produces the same output after loading
        model = AutoModelForCausalLM.from_pretrained(model_id).to(0)
        model = PeftModel.from_pretrained(model, tmp_path)
        output_loaded = model(inputs).logits
        assert torch.allclose(output_peft, output_loaded, atol=atol, rtol=rtol)

If you agree, just create a new file tests/test_boft.py and add it there (with the copyright notice on top).

One thing I was concerned about was whether this would cause trouble loading old checkpoints that still contain the boft_P buffer. It should not matter, but just to be sure, I created such a checkpoint and wrote a test to check that this checkpoint can be successfully loaded. This test can be added below the previous one.

    def test_boft_old_checkpoint_including_boft_P(self, tmp_path):
        # see #2050
        # This test exists to ensure that after the boft_P buffer was made non-persistent, old checkpoints can still be
        # loaded successfully.
        torch.manual_seed(0)

        inputs = torch.arange(10).view(-1, 1).to(0)
        model_id = "hf-internal-testing/tiny-random-OPTForCausalLM"
        model = AutoModelForCausalLM.from_pretrained(model_id).to(0)

        # first create the expected output
        config = BOFTConfig(init_weights=False)
        model = get_peft_model(model, config)
        model.eval();
        output_peft = model(inputs).logits
        del model

        model = AutoModelForCausalLM.from_pretrained(model_id).to(0)
        # checkpoint from before the PR whose state_dict still contains boft_P
        hub_id = "peft-internal-testing/boft-tiny-opt-peft-v0.12"
        model = PeftModel.from_pretrained(model, hub_id)
        output_old = model(inputs).logits

        atol, rtol = 1e-9, 1e-9
        assert torch.allclose(output_peft, output_old, atol=atol, rtol=rtol)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member
@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests, I only have some minor comments. Could you please also run make style with a recent ruff version? Apart from that, let's wait a bit if Zeju wants to add anything, otherwise I'll merge as is.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Copy link
Member
@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @sywangyi I forgot to make the test device agnostic. With the suggested changes, the CI should pass.

sywangyi and others added 2 commits September 6, 2024 06:45
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@BenjaminBossan
Copy link
Member

@sywangyi For some reason, the tests fail on MacOS. I don't have a machine to test this on, but looking at the logs, the tensors look identical but still the torch.allclose call returns False:

FAILED tests/test_boft.py::TestBoft::test_boft_old_checkpoint_including_boft_P - assert False
 +  where False = <built-in method allclose of type object at 0x10e63f150>(
tensor([
[[ 0.2096,  0.0000, -0.0209,  ...,  0.1637, -0.1071,  0.0516]],\n\n        
[[ 0.0901,  0.0000,  0.0214,  ...,  0...0301]],\n\n        
[[ 0.0866,  0.0000,  0.0419,  ...,  0.1571, -0.1145, -0.0144]]
],\n       grad_fn=<UnsafeViewBackward0>), tensor([
[[ 0.2096,  0.0000, -0.0209,  ...,  0.1637, -0.1071,  0.0516]],\n\n        
[[ 0.0901,  0.0000,  0.0214,  ...,  0..., -0.0659,  ...,  0.0609, -0.0228,  0.0301]],\n\n        
[[ 0.0866,  0.0000,  0.0419,  ...,  0.1571, -0.1145, -0.0144]]]), atol=1e-09, rtol=1e-09)
 +    where <built-in method allclose of type object at 0x10e63f150> = torch.allclose

(I added some line breaks for readability)

Probably it's enough to loosen the tolerance, 1e-9 really isn't necessary.

Other than that, I'd wait until end of week if Zeju replies, otherwise I'll merge even without their approval.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Copy link
Member
@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this improvement, LGTM.

We didn't get a response, so I'll just go ahead and merge this.

@BenjaminBossan BenjaminBossan merged commit 2520227 into huggingface:main Sep 13, 2024
14 checks passed
BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Sep 18, 2024
The buffer does not need to be part of the checkpoint, by making it
non-persistent, the file size can be greatly reduced.
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
The buffer does not need to be part of the checkpoint, by making it
non-persistent, the file size can be greatly reduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0