[bug] ROCm segfault when running multi-gpu inference.

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Expected Tensor split to leverage multi gpus.

Current Behavior

Segfault after model loading when using multi-gpu. Correct inference when using either GPU(two vega-56s installed) and HIP_VISIBLE_DEVICES to force single GPU inference.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

Ryzen 1700x
Vega-56 8G*2

Operating System (Ubuntu LTS):

Linux jerryxu-Inspiron-5675 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.10.13
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

See logs.

Steps to Reproduce.

Compile llama.cpp with ROCm
run any model with tensor split (tried 2 quantizations of 7B and 13B)
get segfault

Failure Logs

llama.cpp log:

Log start
main: build = 1310 (1c84003)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1696299120
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX Vega, compute capability 9.0
  Device 1: Radeon RX Vega, compute capability 9.0
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 76.38 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB
llama_new_context_with_model: total VRAM used: 4801.43 MB (model: 4474.93 MB, context: 326.50 MB)
段错误 (核心已转储)

GDB stacktrace on segfault:

#0  0x00007ffff672582e in ?? () from /opt/rocm/lib/libamdhip64.so.5
#1  0x00007ffff672dba0 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#2  0x00007ffff672fc6d in ?? () from /opt/rocm/lib/libamdhip64.so.5
#3  0x00007ffff66f8a44 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#4  0x00007ffff65688e7 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#5  0x00007ffff65689e5 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#6  0x00007ffff6568ae0 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#7  0x00007ffff65ac7a2 in hipMemcpy2DAsync () from /opt/rocm/lib/libamdhip64.so.5
#8  0x00005555556917e6 in ggml_cuda_op_mul_mat (src0=0x7ffd240e06b0, src1=0x7f8ab9ea0860, dst=0x7f8ab9ea09b0, 
    op=0x55555569f330 <ggml_cuda_op_mul_mat_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t* const&)>, convert_src1_to_q8_1=true)
    at ggml-cuda.cu:6706
#9  0x000055555568cc45 in ggml_cuda_mul_mat (src0=0x7ffd240e06b0, src1=0x7f8ab9ea0860, dst=0x7f8ab9ea09b0) at ggml-cuda.cu:6895
#10 0x000055555568c754 in ggml_cuda_compute_forward (params=0x7ffffffebbb0, tensor=0x7f8ab9ea09b0) at ggml-cuda.cu:7388
#11 0x00005555555b4d1d in ggml_compute_forward (params=0x7ffffffebbb0, tensor=0x7f8ab9ea09b0) at ggml.c:16214
#12 0x00005555555b9a94 in ggml_graph_compute_thread (data=0x7ffffffebc00) at ggml.c:17911
#13 0x00005555555bb123 in ggml_graph_compute (cgraph=0x7f8ab9e00020, cplan=0x7ffffffebd00) at ggml.c:18440
#14 0x00005555555c72aa in ggml_graph_compute_helper (buf=std::vector of length 25112, capacity 25112 = {...}, graph=0x7f8ab9e00020, n_threads=1) at llama.cpp:478
#15 0x00005555555da79f in llama_decode_internal (lctx=..., batch=...) at llama.cpp:4144
#16 0x00005555555e6d41 in llama_decode (ctx=0x5555628ba020, batch=...) at llama.cpp:7454
#17 0x0000555555665dcf in llama_init_from_gpt_params (params=...) at common/common.cpp:845
#18 0x0000555555567b32 in main (argc=8, argv=0x7fffffffde08) at examples/main/main.cpp:181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce.

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce.

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions