8000 Merge branch 'main' into integrate-functionary · notwa/llama-cpp-python@7ea2e6e · GitHub
[go: up one dir, main page]

Skip to content

Commit 7ea2e6e

Browse files
authored
Merge branch 'main' into integrate-functionary
2 parents 9594d5c + da003d8 commit 7ea2e6e

File tree

13 files changed

+288
-46
lines changed

13 files changed

+288
-46
lines changed

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.2.36]
11+
12+
- feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
13+
- feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799
14+
15+
## [0.2.35]
16+
17+
- feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00
18+
19+
## [0.2.34]
20+
21+
- feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
22+
- feat: Add json schema mode by @abetlen in #1122
23+
24+
## [0.2.33]
25+
26+
- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
27+
- feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5
28+
- fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f
29+
- fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116
30+
- fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7
31+
1032
## [0.2.32]
1133

1234
- feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ build.blis:
2727
build.metal:
2828
CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install --verbose -e .
2929

30+
build.vulkan:
31+
CMAKE_ARGS="-DLLAMA_VULKAN=on" python3 -m pip install --verbose -e .
32+
3033
build.sdist:
3134
python3 -m build --sdist
3235

README.md

Lines changed: 82 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-
7171

7272
#### cuBLAS
7373

74-
To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
74+
To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
7575

7676
```bash
7777
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
@@ -87,7 +87,7 @@ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
8787

8888
#### CLBlast
8989

90-
To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
90+
To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
9191

9292
```bash
9393
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
@@ -101,9 +101,18 @@ To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on`
101101
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
102102
```
103103

104+
#### Vulkan
105+
106+
To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
107+
108+
```bash
109+
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
110+
```
111+
104112
### Windows Notes
105113

106114
If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
115+
107116
```ps
108117
$env:CMAKE_GENERATOR = "MinGW Makefiles"
109118
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
@@ -118,17 +127,19 @@ Detailed MacOS Metal GPU install documentation is available at [docs/install/mac
118127
#### M1 Mac Performance Issue
119128

120129
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
121-
```
130+
131+
```bash
122132
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
123133
bash Miniforge3-MacOSX-arm64.sh
124134
```
135+
125136
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
126137

127138
#### M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
128139

129140
Try installing with
130141

131-
```
142+
```bash
132143
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
133144
```
134145

@@ -152,10 +163,15 @@ Below is a short example demonstrating how to use the high-level API to for basi
152163

153164
```python
154165
>>> from llama_cpp import Llama
155-
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
166+
>>> llm = Llama(
167+
model_path="./models/7B/llama-model.gguf",
168+
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
169+
# seed=1337, # Uncomment to set a specific seed
170+
# n_ctx=2048, # Uncomment to increase the context window
171+
)
156172
>>> output = llm(
157173
"Q: Name the planets in the solar system? A: ", # Prompt
158-
max_tokens=32, # Generate up to 32 tokens
174+
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
159175
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
160176
echo=True # Echo the prompt back in the output
161177
) # Generate a completion, can also call create_completion
@@ -191,7 +207,10 @@ Note that `chat_format` option must be set for the particular model you are usin
191207

192208
```python
193209
>>> from llama_cpp import Llama
194-
>>> llm = Llama(model_path="path/to/llama-2/llama-model.gguf", chat_format="llama-2")
210+
>>> llm = Llama(
211+
model_path="path/to/llama-2/llama-model.gguf",
212+
chat_format="llama-2"
213+
)
195214
>>> llm.create_chat_completion(
196215
messages = [
197216
{"role": "system", "content": "You are an assistant who perfectly describes images."},
@@ -205,6 +224,59 @@ Note that `chat_format` option must be set for the particular model you are usin
205224

206225
Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
207226

227+
### JSON and JSON Schema Mode
228+
229+
If you want to constrain chat responses to only valid JSON or a specific JSON Schema you can use the `response_format` argument to the `create_chat_completion` method.
230+
231+
#### JSON Mode
232+
233+
The following example will constrain the response to be valid JSON.
234+
235+
```python
236+
>>> from llama_cpp import Llama
237+
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
238+
>>> llm.create_chat_completion(
239+
messages=[
240+
{
241+
"role": "system",
242+
"content": "You are a helpful assistant that outputs in JSON.",
243+
},
244+
{"role": "user", "content": "Who won the world series in 2020"},
245+
],
246+
response_format={
247+
"type": "json_object",
248+
},
249+
temperature=0.7,
250+
)
251+
```
252+
253+
#### JSON Schema Mode
254+
255+
To constrain the response to a specific JSON Schema, you can use the `schema` property of the `response_format` argument.
256+
257+
```python
258+
>>> from llama_cpp import Llama
259+
>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
260+
>>> llm.create_chat_completion(
261+
messages=[
262+
{
263+
"role": "system",
264+
"content": "You are a helpful assistant that outputs in JSON.",
265+
},
266+
{"role": "user", "content": "Who won the world series in 2020"},
267+
],
268+
response_format={
269+
"type": "json_object",
270+
"schema": {
271+
"type": "object",
272+
"properties": {"team_name": {"type": "string"}},
273+
"required": ["team_name"],
274+
},
275+
},
276+
temperature=0.7,
277+
)
278+
```
279+
208280
### Function Calling
209281

210282
The high-level API also provides a simple interface for function calling.
@@ -410,6 +482,9 @@ pip install -e .[all]
410482
make clean
411483
```
412484

485+
You can also test out specific commits of `lama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require
486+
changes to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).
487+
413488
# 1241 # FAQ
414489

415490
### Are there pre-built binaries / binary wheels available?

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.2.32"
4+
__version__ = "0.2.36"

llama_cpp/_internals.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,13 +216,13 @@ def metadata(self) -> Dict[str, str]:
216216
for i in range(llama_cpp.llama_model_meta_count(self.model)):
217217
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
218218
if nbytes > buffer_size:
219-
buffer_size = nbytes
219+
buffer_size = nbytes + 1
220220
buffer = ctypes.create_string_buffer(buffer_size)
221221
nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
222222
key = buffer.value.decode("utf-8")
223223
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
224224
if nbytes > buffer_size:
225-
buffer_size = nbytes
225+
buffer_size = nbytes + 1
226226
buffer = ctypes.create_string_buffer(buffer_size)
227227
nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
228228
value = buffer.value.decode("utf-8")

llama_cpp/llama.py

Lines changed: 57 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ def __init__(
9191
# Backend Params
9292
numa: bool = False,
9393
# Chat Format Params
94-
chat_format: str = "llama-2",
94+
chat_format: Optional[str] = None,
9595
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None,
9696
# Misc
9797
verbose: bool = True,
@@ -199,32 +199,32 @@ def __init__(
199199
self.model_params.use_mmap = use_mmap if lora_path is None else False
200200
self.model_params.use_mlock = use_mlock
201201

202+
# kv_overrides is the original python dict
202203
self.kv_overrides = kv_overrides
203204
if kv_overrides is not None:
204-
n_overrides = len(kv_overrides)
205-
self._kv_overrides_array = llama_cpp.llama_model_kv_override * (n_overrides + 1)
206-
self._kv_overrides_array_keys = []
207-
208-
for k, v in kv_overrides.items():
209-
key_buf = ctypes.create_string_buffer(k.encode("utf-8"))
210-
self._kv_overrides_array_keys.append(key_buf)
211-
self._kv_overrides_array[i].key = key_buf
212-
if isinstance(v, int):
205+
# _kv_overrides_array is a ctypes.Array of llama_model_kv_override Structs
206+
kvo_array_len = len(kv_overrides) + 1 # for sentinel element
207+
self._kv_overrides_array = (
208+
llama_cpp.llama_model_kv_override * kvo_array_len
209+
)()
210+
211+
for i, (k, v) in enumerate(kv_overrides.items()):
212+
self._kv_overrides_array[i].key = k.encode("utf-8")
213+
if isinstance(v, bool):
214+
self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_BOOL
215+
self._kv_overrides_array[i].value.bool_value = v
216+
elif isinstance(v, int):
213217
self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_INT
214218
self._kv_overrides_array[i].value.int_value = v
215219
elif isinstance(v, float):
216220
self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_FLOAT
217221
self._kv_overrides_array[i].value.float_value = v
218-
elif isinstance(v, bool):
219-
self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_BOOL
220-
self._kv_overrides_array[i].value.bool_value = v
221222
else:
222223
raise ValueError(f"Unknown value type for {k}: {v}")
223224

224-
self._kv_overrides_array_sentinel_key = b'\0'
225-
226-
# null array sentinel
227-
self._kv_overrides_array[n_overrides].key = self._kv_overrides_array_sentinel_key
225+
self._kv_overrides_array[
226+
-1
227+
].key = b"\0" # ensure sentinel element is zeroed
228228
self.model_params.kv_overrides = self._kv_overrides_array
229229

230230
self.n_batch = min(n_ctx, n_batch) # ???
@@ -342,18 +342,55 @@ def __init__(
342342
(n_ctx, self._n_vocab), dtype=np.single
343343
)
344344

345-
self._mirostat_mu = ctypes.c_float(2.0 * 5.0) # TODO: Move this to sampling context
345+
self._mirostat_mu = ctypes.c_float(
346+
2.0 * 5.0
347+
) # TODO: Move this to sampling context
346348

347349
try:
348350
self.metadata = self._model.metadata()
349351
except Exception as e:
350352
self.metadata = {}
351353
if self.verbose:
352354
print(f"Failed to load metadata: {e}", file=sys.stderr)
353-
355+
354356
if self.verbose:
355357
print(f"Model metadata: {self.metadata}", file=sys.stderr)
356358

359+
if self.chat_format is None and self.chat_handler is None and "tokenizer.chat_template" in self.metadata:
360+
chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(self.metadata)
361+
362+
if chat_format is not None:
363+
self.chat_format = chat_format
364+
if self.verbose:
365+
print(f"Guessed chat format: {chat_format}", file=sys.stderr)
366+
else:
367+
template = self.metadata["tokenizer.chat_template"]
368+
try:
369+
eos_token_id = int(self.metadata["tokenizer.ggml.eos_token_id"])
370+
except:
371+
eos_token_id = self.token_eos()
372+
try:
373+
bos_token_id = int(self.metadata["tokenizer.ggml.bos_token_id"])
374+
except:
375+
EBF3 bos_token_id = self.token_bos()
376+
377+
eos_token = self.detokenize([eos_token_id]).decode("utf-8")
378+
bos_token = self.detokenize([bos_token_id]).decode("utf-8")
379+
380+
if self.verbose:
381+
print(f"Using chat template: {template}", file=sys.stderr)
382+
print(f"Using chat eos_token: {eos_token}", file=sys.stderr)
383+
print(f"Using chat bos_token: {bos_token}", file=sys.stderr)
384+
385+
self.chat_handler = llama_chat_format.Jinja2ChatFormatter(
386+
template=template,
387+
eos_token=eos_token,
388+
bos_token=bos_token
389+
).to_chat_handler()
390+
391+
if self.chat_format is None and self.chat_handler is None:
392+
self.chat_format = "llama-2"
393+
357394
@property
358395
def ctx(self) -> llama_cpp.llama_context_p:
359396
assert self._ctx.ctx is not None
@@ -550,7 +587,7 @@ def sample(
550587
candidates=self._candidates,
551588
tau=mirostat_tau,
552589
eta=mirostat_eta,
553-
mu=ctypes.pointer(self._mirostat_mu)
590+
mu=ctypes.pointer(self._mirostat_mu),
554591
)
555592
else:
556593
self._ctx.sample_top_k(candidates=self._candidates, k=top_k, min_keep=1)

0 commit comments

Comments
 (0)
0