8000 [BUG] PT parallel training neighbor stat OOM · Issue #4594 · deepmodeling/deepmd-kit · GitHub
[go: up one dir, main page]

Skip to content
[BUG] PT parallel training neighbor stat OOM #4594
@njzjz

Description

@njzjz

Bug summary

Parallel training using the PyTorch backend throws OOM during the neighbor statics step.

DeePMD-kit Version

v3.0.1

Backend and its version

PyTorch v2.4.1.post302

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

py child error file (/tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json)
Traceback (most recent call last):
  File "/root/deepmd-kit/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
dp FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-10_17:56:48
  host      : bohrium-156-1256408
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1349)
  error_file: /tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 527, in main
      train(
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 317, in train
      config["model"], min_nbor_dist = BaseModel.update_sel(
                                       ^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/model/base_model.py", line 192, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/dp_model.py", line 45, in update_sel
      local_jdata_cpy["descriptor"], min_nbor_dist = BaseDescriptor.update_sel(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/descriptor/make_base_descriptor.py", line 238, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa1.py", line 739, in update_sel
      min_nbor_dist, sel = UpdateSel().update_one_sel(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 33, in update_one_sel
      min_nbor_dist, tmp_sel = self.get_nbor_stat(
                               ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 122, in get_nbor_stat
      min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/neighbor_stat.py", line 66, in get_stat
      for mn, dt, jj in self.iterator(data):
                        ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 159, in iterator
      minrr2, max_nnei = self.auto_batch_size.execute_all(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
      n_batch, result = self.execute(execute_with_batch_size, index, natoms)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 111, in execute
      raise e
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 108, in execute
      n_batch, result = callable(max(batch_nframes, 1), start_index)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 174, in execute_with_batch_size
      return (end_index - start_index), callable(
                                        ^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 186, in _execute
      minrr2, max_nnei = self.op(
                         ^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: CUDA error: out of memory
  CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  For debugging consider passing CUDA_LAUNCH_BLOCKING=1
  Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps to Reproduce

cd examples/water/se_atten
torchrun --nproc_per_node=4 --no-python dp --pt train input.json

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0