-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Description
When fine-tuning the 13B model, it encounters an exception as displayed below, whereas it functions properly with the 7B model. Can anyone assist me in resolving this issue?
Does this issues because not enough space?
Do you have met this issues before? @congchan
File "FastChat/fastchat/train/train_with_template.py", line 75, in safe_save_model_for_hf_trainer
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
File "FastChat/fastchat/train/train_with_template.py", line 75, in <dictcomp>
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 761) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Metadata
Metadata
Assignees
Labels
No labels