8000 when finetune 13B model it will have error on safe_save_model_for_hf_trainer · Issue #3217 · lm-sys/FastChat · GitHub
[go: up one dir, main page]

Skip to content
when finetune 13B model it will have error on safe_save_model_for_hf_trainer #3217
@Oscarjia

Description

@Oscarjia

When fine-tuning the 13B model, it encounters an exception as displayed below, whereas it functions properly with the 7B model. Can anyone assist me in resolving this issue?

Does this issues because not enough space?

Do you have met this issues before? @congchan

File "FastChat/fastchat/train/train_with_template.py", line 75, in safe_save_model_for_hf_trainer
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
  File "FastChat/fastchat/train/train_with_template.py", line 75, in <dictcomp>
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 761) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0