-
Notifications
You must be signed in to change notification settings - Fork 561
Closed
Labels
Description
Bug summary
Parallel training using the PyTorch backend throws OOM during the neighbor statics step.
DeePMD-kit Version
v3.0.1
Backend and its version
PyTorch v2.4.1.post302
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
py child error file (/tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json)
Traceback (most recent call last):
File "/root/deepmd-kit/bin/torchrun", line 10, in <module>
sys.exit(main())
^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
dp FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-10_17:56:48
host : bohrium-156-1256408
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 1349)
error_file: /tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json
traceback : Traceback (most recent call last):
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 527, in main
train(
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 317, in train
config["model"], min_nbor_dist = BaseModel.update_sel(
^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/model/base_model.py", line 192, in update_sel
return cls.update_sel(train_data, type_map, local_jdata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/dp_model.py", line 45, in update_sel
local_jdata_cpy["descriptor"], min_nbor_dist = BaseDescriptor.update_sel(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/descriptor/make_base_descriptor.py", line 238, in update_sel
return cls.update_sel(train_data, type_map, local_jdata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa1.py", line 739, in update_sel
min_nbor_dist, sel = UpdateSel().update_one_sel(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 33, in update_one_sel
min_nbor_dist, tmp_sel = self.get_nbor_stat(
^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 122, in get_nbor_stat
min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/neighbor_stat.py", line 66, in get_stat
for mn, dt, jj in self.iterator(data):
^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 159, in iterator
minrr2, max_nnei = self.auto_batch_size.execute_all(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 111, in execute
raise e
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 108, in execute
n_batch, result = callable(max(batch_nframes, 1), start_index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 174, in execute_with_batch_size
return (end_index - start_index), callable(
^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 186, in _execute
minrr2, max_nnei = self.op(
^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Steps to Reproduce
cd examples/water/se_atten
torchrun --nproc_per_node=4 --no-python dp --pt train input.json
Further Information, Files, and Links
No response