CUDA error: an illegal memory access was encountered (with large prompts)

Name and Version

./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 5518 (26b79b6)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

llama-cli -t 4 --flash-attn --color --conversation --multiline-input --mirostat 2 --tensor-split 1.2,0.8 --ctx-size $((8192*20)) --n-gpu-layers 66 --temp 0.9 -m /work/models/misc/gemma/gemma-3-12B-it-Q6_KLA.gguf

Problem description & steps to reproduce

Large prompt (above 16k) consistently fails with error:

paste text >16
crash

First Bad Commit

952f395

Relevant log output

work/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /work/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2461
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 69906]
[New LWP 69913]
[New LWP 69914]
[New LWP 69915]
[New LWP 69916]
[New LWP 69917]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fe36ccf2c17 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fe36ccf2c17 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fe36d1b81c5 in ggml_abort () from /work/src/llama.cpp/build/bin/libggml-base.so
#2  0x00007fe36a4b75b3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /work/src/llama.cpp/build/bin/libggml-cuda.so
#3  0x00007fe36a4b8aea in ggml_backend_cuda_synchronize(ggml_backend*) () from /work/src/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007fe36d1c61c6 in ggml_backend_sched_synchronize () from /work/src/llama.cpp/build/bin/libggml-base.so
#5  0x00007fe36d2f20c0 in llama_context::synchronize() () from /work/src/llama.cpp/build/bin/libllama.so
#6  0x00007fe36d2f3b50 in llama_get_logits_ith () from /work/src/llama.cpp/build/bin/libllama.so
#7  0x000055adf0f2aaa3 in common_sampler_sample(common_sampler*, llama_context*, int, bool) ()
#8  0x000055adf0e1dcaa in main ()

compute-sanitizer show few K of those:

========= Invalid __global__ read of size 4 bytes
=========     at void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x490
=========     by thread (32,0,0) in block (0,0,0)
=========     Address 0x9dc06ca7e680 is out of bounds
=========     and is 32,958,472,250,497 bytes after the nearest allocation at 0x7fc6aea00000 of size 512 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x381497] in libcuda.so.1
=========         Host Frame:  [0x13e88] in libcudart.so.12
=========         Host Frame: cudaLaunchKernel [0x79f87] in libcudart.so.12
=========         Host Frame: get_rows_cuda(void const*, ggml_type, int const*, void*, ggml_type, long, unsigned long, unsigned long, unsigned long, long, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, CUstream_st*) [0xb3224] in libggml-cuda.so
=========         Host Frame: ggml_cuda_op_get_rows(ggml_backend_cuda_context&, ggml_tensor*) [0xb65b0] in libggml-cuda.so
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0xc398a] in libggml-cuda.so
=========         Host Frame: ggml_backend_sched_graph_compute_async [0x27c92] in libggml-base.so
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) [0x72a28] in libllama.so
=========         Host Frame: llama_context::decode(llama_batch&) [0x77175] in libllama.so
=========         Host Frame: llama_decode [0x78522] in libllama.so
=========         Host Frame: main [0xbf913] in llama-cli
=========
========= Invalid __global__ read of size 4 bytes
=========     at void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x490
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x9dc06ca7e684 is out of bounds
=========     and is 32,958,472,250,501 bytes after the nearest allocation at 0x7fc6aea00000 of size 512 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x381497] in libcuda.so.1
=========         Host Frame:  [0x13e88] in libcudart.so.12
=========         Host Frame: cudaLaunchKernel [0x79f87] in libcudart.so.12
=========         Host Frame: get_rows_cuda(void const*, ggml_type, int const*, void*, ggml_type, long, unsigned long, unsigned long, unsigned long, long, long, long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, CUstream_st*) [0xb3224] in libggml-cuda.so
=========         Host Frame: ggml_cuda_op_get_rows(ggml_backend_cuda_context&, ggml_tensor*) [0xb65b0] in libggml-cuda.so
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0xc398a] in libggml-cuda.so
=========         Host Frame: ggml_backend_sched_graph_compute_async [0x27c92] in libggml-base.so
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) [0x72a28] in libllama.so
=========         Host Frame: llama_context::decode(llama_batch&) [0x77175] in libllama.so
=========         Host Frame: llama_decode [0x78522] in libllama.so
=========         Host Frame: main [0xbf913] in llama-cli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions