You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backend/SYCL.md
+51-34Lines changed: 51 additions & 34 deletions
Original file line number
Diff line number
Diff line change
@@ -17,25 +17,25 @@
17
17
18
18
**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17.
19
19
20
-
**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:
20
+
**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to Intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:
21
21
22
22
-**DPCPP***(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers.
23
23
-**oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. Intel oneMKL, oneMath and oneDNN)*.
24
-
-**oneAPI LevelZero**: A high performance low level interface for fine-grained control over intel iGPUs and dGPUs.
24
+
-**oneAPI LevelZero**: A high performance low level interface for fine-grained control over Intel iGPUs and dGPUs.
25
25
-**Nvidia & AMD Plugins**: These are plugins extending oneAPI's DPCPP support to SYCL on Nvidia and AMD GPU targets.
26
26
27
27
### Llama.cpp + SYCL
28
28
29
-
The llama.cpp SYCL backend is designed to support **Intel GPU** firstly. Based on the cross-platform feature of SYCL, it also supports other vendor GPUs: Nvidia and AMD.
29
+
The llama.cpp SYCL backend is primarily designed for **Intel GPUs**.
30
+
SYCL cross-platform capabilities enable support for Nvidia GPUs as well, with limited support for AMD.
30
31
31
32
## Recommended Release
32
33
33
-
The SYCL backend would be broken by some PRs due to no online CI.
34
-
35
-
The following release is verified with good quality:
34
+
The following releases are verified and recommended:
| Intel Data Center Max Series | Support | Max 1550, 1100 |
108
108
| Intel Data Center Flex Series | Support | Flex 170 |
109
-
| Intel Arc Series | Support | Arc 770, 730M, Arc A750|
110
-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake|
111
-
| Intel iGPU | Support | iGPU in 13700k,iGPU in 13400, i5-1250P, i7-1260P, i7-1165G7 |
109
+
| Intel Arc Series | Support | Arc 770, 730M, Arc A750, B580|
110
+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake, Lunar Lake|
111
+
| Intel iGPU | Support | iGPU in 13700k,13400, i5-1250P, i7-1260P, i7-1165G7|
112
112
113
113
*Notes:*
114
114
115
115
-**Memory**
116
116
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-cli`.
117
-
118
117
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.
119
118
120
119
-**Execution Unit (EU)**
@@ -138,19 +137,22 @@ Note: AMD GPU support is highly experimental and is incompatible with F16.
138
137
Additionally, it only supports GPUs with a sub_group_size (warp size) of 32.
139
138
140
139
## Docker
141
-
The docker build option is currently limited to *intel GPU* targets.
140
+
141
+
The docker build option is currently limited to *Intel GPU* targets.
To build in default FP32 *(Slower than FP16 alternative)*, you can remove the `--build-arg="GGML_SYCL_F16=ON"`argument from the previous command.
152
+
To build in default FP32 *(Slower than FP16 alternative)*, set `--build-arg="GGML_SYCL_F16=OFF"`in the previous command.
152
153
153
154
You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative.
155
+
Check the [documentation for Docker](../docker.md) to see the available images.
154
156
155
157
### Run container
156
158
@@ -250,7 +252,7 @@ sycl-ls
250
252
251
253
-**Intel GPU**
252
254
253
-
When targeting an intel GPU, the user should expect one or more level-zero devices among the available SYCL devices. Please make sure that at least one GPU is present, for instance [`level_zero:gpu`] in the sample output below:
255
+
When targeting an intel GPU, the user should expect one or more devices among the available SYCL devices. Please make sure that at least one GPU is present via `sycl-ls`, for instance `[level_zero:gpu]` in the sample output below:
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf)model as example.
356
+
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf)or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
355
357
356
358
##### Check device
357
359
@@ -398,11 +400,15 @@ Choose one of following methods to run.
398
400
399
401
```sh
400
402
./examples/sycl/run-llama2.sh 0
403
+
# OR
404
+
./examples/sycl/run-llama3.sh 0
401
405
```
402
406
- Use multiple devices:
403
407
404
408
```sh
405
409
./examples/sycl/run-llama2.sh
410
+
# OR
411
+
./examples/sycl/run-llama3.sh
406
412
```
407
413
408
414
2. Command line
@@ -425,13 +431,13 @@ Examples:
425
431
- Use device 0:
426
432
427
433
```sh
428
-
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
434
+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm none -mg 0
429
435
```
430
436
431
437
- Use multiple devices:
432
438
433
439
```sh
434
-
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
440
+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm layer
435
441
```
436
442
437
443
*Notes:*
@@ -452,7 +458,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
452
458
453
459
1. Install GPU driver
454
460
455
-
Intel GPU drivers instructions guide and download page can be found here: [Get intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
461
+
Intel GPU drivers instructions guide and download page can be found here: [Get Intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
456
462
457
463
2. Install Visual Studio
458
464
@@ -629,7 +635,7 @@ Once it is completed, final results will be in **build/Release/bin**
629
635
630
636
#### Retrieve and prepare model
631
637
632
-
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf)model as example.
638
+
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf)or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
633
639
634
640
##### Check device
635
641
@@ -648,7 +654,7 @@ Similar to the native `sycl-ls`, available SYCL devices can be queried as follow
648
654
build\bin\llama-ls-sycl-device.exe
649
655
```
650
656
651
-
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
657
+
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following:
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path.<br>FP32 path - recommended for better perforemance than FP16 on quantized model|
744
+
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path.|
730
745
| GGML_SYCL_TARGET | INTEL *(default)*\| NVIDIA \| AMD | Set the SYCL target device type. |
731
746
| GGML_SYCL_DEVICE_ARCH | Optional (except for AMD) | Set the SYCL device architecture, optional except for AMD. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
732
-
| GGML_SYCL_F16 | OFF *(default)*\|ON *(optional)*| Enable FP16 build with SYCL code path. |
747
+
| GGML_SYCL_F16 | OFF *(default)*\|ON *(optional)*| Enable FP16 build with SYCL code path. (1.)|
733
748
| GGML_SYCL_GRAPH | ON *(default)*\|OFF *(Optional)*| Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
734
749
| GGML_SYCL_DNN | ON *(default)*\|OFF *(Optional)*| Enable build with oneDNN. |
735
750
| CMAKE_C_COMPILER |`icx`*(Linux)*, `icx/cl`*(Windows)*| Set `icx` compiler for SYCL code path. |
736
751
| CMAKE_CXX_COMPILER |`icpx`*(Linux)*, `icx`*(Windows)*| Set `icpx/icx` compiler for SYCL code path. |
737
752
753
+
1. FP16 is recommended for better prompt processing performance on quantized models. Performance is equivalent in text generation but set `GGML_SYCL_F16=OFF` if you are experiencing issues with FP16 builds.
754
+
738
755
#### Runtime
739
756
740
757
| Name | Value | Function |
@@ -752,7 +769,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
752
769
753
770
## Q&A
754
771
755
-
- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
772
+
- Error: `error while loading shared libraries: libsycl.so: cannot open shared object file: No such file or directory`.
756
773
757
774
- Potential cause: Unavailable oneAPI installation or not set ENV variables.
758
775
- Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`.
@@ -781,18 +798,18 @@ use 1 SYCL GPUs: [0] with Max compute units:512
781
798
782
799
It's same for other projects including llama.cpp SYCL backend.
783
800
784
-
-Meet issue: `Native API failed. Native API returns: -6 (PI_ERROR_OUT_OF_HOST_MEMORY) -6 (PI_ERROR_OUT_OF_HOST_MEMORY) -999 (UNKNOWN PI error)` or `failed to allocate SYCL0 buffer`
801
+
-`Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)`, `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 3503030272 Bytes of memory on device`, or `failed to allocate SYCL0 buffer`
785
802
786
-
Device Memory is not enough.
803
+
You are running out of Device Memory.
787
804
788
805
|Reason|Solution|
789
806
|-|-|
790
-
|Default Context is too big. It leads to more memory usage.|Set `-c 8192` or smaller value.|
791
-
|Model is big and require more memory than device's.|Choose smaller quantized model, like Q5 -> Q4;<br>Use more than one devices to load model.|
807
+
| The default context is too big. It leads to excessive memory usage.|Set `-c 8192` or a smaller value.|
808
+
| The model is too big and requires more memory than what is available.|Choose a smaller model or change to a smaller quantization, like Q5 -> Q4;<br>Alternatively, use more than one device to load model.|
792
809
793
810
### **GitHub contribution**:
794
-
Please add the **[SYCL]**prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
811
+
Please add the `SYCL :`prefix/tag in issues/PRs titles to help the SYCL contributors to check/address them without delay.
Copy file name to clipboardExpand all lines: docs/docker.md
+3Lines changed: 3 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,9 @@ Additionally, there the following images, similar to the above:
22
22
-`ghcr.io/ggml-org/llama.cpp:full-musa`: Same as `full` but compiled with MUSA support. (platforms: `linux/amd64`)
23
23
-`ghcr.io/ggml-org/llama.cpp:light-musa`: Same as `light` but compiled with MUSA support. (platforms: `linux/amd64`)
24
24
-`ghcr.io/ggml-org/llama.cpp:server-musa`: Same as `server` but compiled with MUSA support. (platforms: `linux/amd64`)
25
+
-`ghcr.io/ggml-org/llama.cpp:full-intel`: Same as `full` but compiled with SYCL support. (platforms: `linux/amd64`)
26
+
-`ghcr.io/ggml-org/llama.cpp:light-intel`: Same as `light` but compiled with SYCL support. (platforms: `linux/amd64`)
27
+
-`ghcr.io/ggml-org/llama.cpp:server-intel`: Same as `server` but compiled with SYCL support. (platforms: `linux/amd64`)
25
28
26
29
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](../.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](../.github/workflows/docker.yml). If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).
0 commit comments