10000 CUDA error: an illegal memory access was encountered (with large prompts) · Issue #13851 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
CUDA error: an illegal memory access was encountered (with large prompts) #13851
Closed
@ko-alex

Description

@ko-alex

Name and Version

./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 5518 (26b79b6)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

llama-cli -t 4 --flash-attn --color --conversation --multiline-input --mirostat 2 --tensor-split 1.2,0.8 --ctx-size $((8192*20)) --n-gpu-layers 66 --temp 0.9 -m /work/models/misc/gemma/gemma-3-12B-it-Q6_KLA.gguf

Problem description & steps to reproduce

Large prompt (above 16k) consistently fails with error:

  • paste text >16
  • crash

First Bad Commit

952f395

Relevant log output

work/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /work/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2461
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 69906]
[New LWP 69913]
[New LWP 69914]
[New LWP 69915]
[New LWP 69916]
[New LWP 69917]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fe36ccf2c17 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fe36ccf2c17 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fe36d1b81c5 in ggml_abort () from /work/src/llama.cpp/build/bin/libggml-base.so
#2  0x00007fe36a4b75b3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /work/src/llama.cpp/build/bin/libggml-cuda.so
#3  0x00007fe36a4b8aea in ggml_backend_cuda_synchronize(ggml_backend*) () from /work/src/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007fe36d1c61c6 in ggml_backend_sched_synchronize () from /work/src/llama.cpp/build/bin/libggml-base.so
#5  0x00007fe36d2f20c0 in llama_context::synchronize() () from /work/src/llama.cpp/build/bin/libllama.so
#6  0x00007fe36d2f3b50 in llama_get_logits_ith () from /work/src/llama.cpp/build/bin/libllama.so
#7  0x000055adf0f2aaa3 in common_sampler_sample(common_sampler*, llama_context*, int, bool) ()
#8  0x000055adf0e1dcaa in main ()

compute-sanitizer show few K of those:

========= Invalid __global__ read of size 4 bytes
=========     at void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x490
=========     by thread (32,0,0) in block (0,0,0)
=========     Address 0x9dc06ca7e680 is out of bounds
=========     and is 32,958,472,250,497 bytes after the nearest allocation at 0x7fc6aea00000 of size 512 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x381497] in libcuda.so.1
=========         Host Frame:  [0x13e88] in libcudart.so.12
=========         Host Frame: cudaLaunchKernel [0x79f87] in libcudart.so.12
=========         Host Frame: get_rows_cuda(void const*, ggml_type, int const*, void*, ggml_type, long, unsigned long, unsigned long, unsigned long, long, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, CUstream_st*) [0xb3224] in libggml-cuda.so
=========         Host Frame: ggml_cuda_op_get_rows(ggml_backend_cuda_context&, ggml_tensor*) [0xb65b0] in libggml-cuda.so
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0xc398a] in libggml-cuda.so
=========         Host Frame: ggml_backend_sched_graph_compute_async [0x27c92] in libggml-base.so
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) [0x72a28] in libllama.so
=========         Host Frame: llama_context::decode(llama_batch&) [0x77175] in libllama.so
=========         Host Frame: llama_decode [0x78522] in libllama.so
=========         Host Frame: main [0xbf913] in llama-cli
=========
========= Invalid __global__ read of size 4 bytes
=========     at void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x490
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x9dc06ca7e684 is out of bounds
=========     and is 32,958,472,250,501 bytes after the nearest allocation at 0x7fc6aea00000 of size 512 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x381497] in libcuda.so.1
=========         Host Frame:  [0x13e88] in libcudart.so.12
=========         Host Frame: cudaLaunchKernel [0x79f87] in libcudart.so.12
=========         Host Frame: get_rows_cuda(void const*, ggml_type, int const*, void*, ggml_type, long, unsigned long, unsigned long, unsigned long, long, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, CUstream_st*) [0xb3224] in libggml-cuda.so
=========         Host Frame: ggml_cuda_op_get_rows(ggml_backend_cuda_context&, ggml_tensor*) [0xb65b0] in libggml-cuda.so
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0xc398a] in libggml-cuda.so
=========         Host Frame: ggml_backend_sched_graph_compute_async [0x27c92] in libggml-base.so
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) [0x72a28] in libllama.so
=========         Host Frame: llama_context::decode(llama_batch&) [0x77175] in libllama.so
=========         Host Frame: llama_decode [0x78522] in libllama.so
=========         Host Frame: main [0xbf913] in llama-cli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0