@@ -346,29 +346,45 @@ complete snapshot of the memory allocator state via
346
346
:meth: `~torch.cuda.memory_snapshot `, which can help you understand the
347
347
underlying allocation patterns produced by your code.
348
348
349
+ .. _cuda-memory-envvars :
350
+
351
+ Environment variables
352
+ ^^^^^^^^^^^^^^^^^^^^^
353
+
349
354
Use of a caching allocator can interfere with memory checking tools such as
350
355
``cuda-memcheck ``. To debug memory errors using ``cuda-memcheck ``, set
351
356
``PYTORCH_NO_CUDA_MEMORY_CACHING=1 `` in your environment to disable caching.
352
357
353
- The behavior of caching allocator can be controlled via environment variable
358
+ The behavior of the caching allocator can be controlled via the environment variable
354
359
``PYTORCH_CUDA_ALLOC_CONF ``.
355
360
The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... ``
356
361
Available options:
357
362
358
- * ``max_split_size_mb `` prevents the allocator from splitting blocks larger
359
- than this size (in MB). This can help prevent fragmentation and may allow
360
- some borderline workloads to complete without running out of memory.
361
- Performance cost can range from 'zero' to 'substatial' depending on
362
- allocation patterns. Default value is unlimited, i.e. all blocks can be
363
- split. The :meth: `~torch.cuda.memory_stats ` and
364
- :meth: `~torch.cuda.memory_summary ` methods are useful for tuning. This
365
- option should be used as a last resort for a workload that is aborting
366
- due to 'out of memory' and showing a large amount of inactive split blocks.
367
363
* ``backend `` allows selecting the underlying allocator implementation.
368
364
Currently, valid options are ``native ``, which uses Pytorch's native
369
365
implementation, and ``cudaMallocAsync ``, which uses
370
366
`CUDA's built-in asynchronous allocator `_.
371
367
``cudaMallocAsync `` requires CUDA 11.4 or newer. The default is ``native ``.
368
+ * ``max_split_size_mb `` prevents the native allocator
369
+ from splitting blocks larger than this size (in MB). This can reduce
370
+ fragmentation and may allow some borderline workloads to complete without
371
+ running out of memory. Performance cost can range from 'zero' to 'substatial'
372
+ depending on allocation patterns. Default value is unlimited, i.e. all blocks
373
+ can be split. The
374
+ :meth: `~torch.cuda.memory_stats ` and
375
+ :meth: `~torch.cuda.memory_summary ` methods are useful for tuning. This
376
+ option should be used as a last resort for a workload that is aborting
377
+ due to 'out of memory' and showing a large amount of inactive split blocks.
378
+ ``max_split_size_mb `` is only meaningful with ``backend:native ``.
379
+ With ``backend:cudaMallocAsync ``, ``max_split_size_mb `` is ignored.
380
+
381
+ .. note ::
382
+
383
+ Some stats reported by the
384
+ :ref: `CUDA memory management API<cuda-memory-management-api> `
385
+ are specific to ``backend:native ``, and are not meaningful with
386
+ ``backend:cudaMallocAsync ``.
387
+ See each function's docstring for details.
372
388
373
389
.. _CUDA's built-in asynchronous allocator :
374
390
https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/
0 commit comments