notwa
diff --git a/‎CHANGELOG.md
Lines changed: 22 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 22 additions & 0 deletions
diff --git a/‎Makefile
Lines changed: 3 additions & 0 deletions b/‎Makefile
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 82 additions & 7 deletions b/‎README.md
Lines changed: 82 additions & 7 deletions
diff --git a/‎llama_cpp/__init__.py
Lines changed: 1 addition & 1 deletion b/‎llama_cpp/__init__.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎llama_cpp/_internals.py
Lines changed: 2 additions & 2 deletions b/‎llama_cpp/_internals.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎llama_cpp/llama.py
Lines changed: 57 additions & 20 deletions b/‎llama_cpp/llama.py
Lines changed: 57 additions & 20 deletions
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.2.36]
+
+- feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
+- feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799
+
+## [0.2.35]
+
+- feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00
+
+## [0.2.34]
+
+- feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
+- feat: Add json schema mode by @abetlen in #1122
+
+## [0.2.33]
+
+- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
+- feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5
+- fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f
+- fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116
+- fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7
+
 ## [0.2.32]
 
 - feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2
 
@@ -27,6 +27,9 @@ build.blis:
 build.metal:
 	CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install --verbose -e .
 
+build.vulkan:
+	CMAKE_ARGS="-DLLAMA_VULKAN=on" python3 -m pip install --verbose -e .
+
 build.sdist:
 	python3 -m build --sdist
 
 
@@ -71,7 +71,7 @@ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-
 
 #### cuBLAS
 
-To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
+To install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:
 
 ```bash
 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
@@ -87,7 +87,7 @@ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
 
 #### CLBlast
 
-To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
+To install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:
 
 ```bash
 CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
@@ -101,9 +101,18 @@ To install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on`
 CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
 ```
 
+#### Vulkan
+
+To install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:
+
+```bash
+CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
+```
+
 ### Windows Notes
 
 If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:
+
 ```ps
 $env:CMAKE_GENERATOR = "MinGW Makefiles"
 $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe" 
@@ -118,17 +127,19 @@ Detailed MacOS Metal GPU install documentation is available at [docs/install/mac
 #### M1 Mac Performance Issue
 
 Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
-```
+
+```bash
 wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
 bash Miniforge3-MacOSX-arm64.sh
 ```
+
 Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
 
 #### M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
 
 Try installing with
 
-```
+```bash
 CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
 ```
 
@@ -152,10 +163,15 @@ Below is a short example demonstrating how to use the high-level API to for basi
 
 ```python
 >>> from llama_cpp import Llama
->>> llm = Llama(model_path="./models/7B/llama-model.gguf")
+>>> llm = Llama(
+      model_path="./models/7B/llama-model.gguf",
+      # n_gpu_layers=-1, # Uncomment to use GPU acceleration 
+      # seed=1337, # Uncomment to set a specific seed
+      # n_ctx=2048, # Uncomment to increase the context window
+)
 >>> output = llm(
       "Q: Name the planets in the solar system? A: ", # Prompt
-      max_tokens=32, # Generate up to 32 tokens
+      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
       stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
       echo=True # Echo the prompt back in the output
 ) # Generate a completion, can also call create_completion
@@ -191,7 +207,10 @@ Note that `chat_format` option must be set for the particular model you are usin
 
 ```python
 >>> from llama_cpp import Llama
->>> llm = Llama(model_path="path/to/llama-2/llama-model.gguf", chat_format="llama-2")
+>>> llm = Llama(
+      model_path="path/to/llama-2/llama-model.gguf",
+      chat_format="llama-2"
+)
 >>> llm.create_chat_completion(
       messages = [
           {"role": "system", "content": "You are an assistant who perfectly describes images."},
@@ -205,6 +224,59 @@ Note that `chat_format` option must be set for the particular model you are usin
 
 Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.
 
+### JSON and JSON Schema Mode
+
+If you want to constrain chat responses to only valid JSON or a specific JSON Schema you can use the `response_format` argument to the `create_chat_completion` method.
+
+#### JSON Mode
+
+The following example will constrain the response to be valid JSON.
+
+```python
+>>> from llama_cpp import Llama
+>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
+>>> llm.create_chat_completion(
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a helpful assistant that outputs in JSON.",
+        },
+        {"role": "user", "content": "Who won the world series in 2020"},
+    ],
+    response_format={
+        "type": "json_object",
+    },
+    temperature=0.7,
+)
+```
+
+#### JSON Schema Mode
+
+To constrain the response to a specific JSON Schema, you can use the `schema` property of the `response_format` argument.
+
+```python
+>>> from llama_cpp import Llama
+>>> llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
+>>> llm.create_chat_completion(
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a helpful assistant that outputs in JSON.",
+        },
+        {"role": "user", "content": "Who won the world series in 2020"},
+    ],
+    response_format={
+        "type": "json_object",
+        "schema": {
+            "type": "object",
+            "properties": {"team_name": {"type": "string"}},
+            "required": ["team_name"],
+        },
+    },
+    temperature=0.7,
+)
+```
+
 ### Function Calling
 
 The high-level API also provides a simple interface for function calling.
@@ -410,6 +482,9 @@ pip install -e .[all]
 make clean
 ```
 
+You can also test out specific commits of `lama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require
+changes to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).
+
 #
1241
# FAQ
 
 ### Are there pre-built binaries / binary wheels available?
 
@@ -1,4 +1,4 @@
 from .llama_cpp import *
 from .llama import *
 
-__version__ = "0.2.32"
+__version__ = "0.2.36"
@@ -216,13 +216,13 @@ def metadata(self) -> Dict[str, str]:
         for i in range(llama_cpp.llama_model_meta_count(self.model)):
             nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
             if nbytes > buffer_size:
-                buffer_size = nbytes
+                buffer_size = nbytes + 1
                 buffer = ctypes.create_string_buffer(buffer_size)
                 nbytes = llama_cpp.llama_model_meta_key_by_index(self.model, i, buffer, buffer_size)
             key = buffer.value.decode("utf-8")
             nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
             if nbytes > buffer_size:
-                buffer_size = nbytes
+                buffer_size = nbytes + 1
                 buffer = ctypes.create_string_buffer(buffer_size)
                 nbytes = llama_cpp.llama_model_meta_val_str_by_index(self.model, i, buffer, buffer_size)
             value = buffer.value.decode("utf-8")
 
@@ -91,7 +91,7 @@ def __init__(
         # Backend Params
         numa: bool = False,
         # Chat Format Params
-        chat_format: str = "llama-2",
+        chat_format: Optional[str] = None,
         chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None,
         # Misc
         verbose: bool = True,
@@ -199,32 +199,32 @@ def __init__(
         self.model_params.use_mmap = use_mmap if lora_path is None else False
         self.model_params.use_mlock = use_mlock
 
+        # kv_overrides is the original python dict
         self.kv_overrides = kv_overrides
         if kv_overrides is not None:
-            n_overrides = len(kv_overrides)
-            self._kv_overrides_array = llama_cpp.llama_model_kv_override * (n_overrides + 1)
-            self._kv_overrides_array_keys = []
-
-            for k, v in kv_overrides.items():
-                key_buf = ctypes.create_string_buffer(k.encode("utf-8"))
-                self._kv_overrides_array_keys.append(key_buf)
-                self._kv_overrides_array[i].key = key_buf
-                if isinstance(v, int):
+            # _kv_overrides_array is a ctypes.Array of llama_model_kv_override Structs
+            kvo_array_len = len(kv_overrides) + 1  # for sentinel element
+            self._kv_overrides_array = (
+                llama_cpp.llama_model_kv_override * kvo_array_len
+            )()
+
+            for i, (k, v) in enumerate(kv_overrides.items()):
+                self._kv_overrides_array[i].key = k.encode("utf-8")
+                if isinstance(v, bool):
+                    self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_BOOL
+                    self._kv_overrides_array[i].value.bool_value = v
+                elif isinstance(v, int):
                     self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_INT
                     self._kv_overrides_array[i].value.int_value = v
                 elif isinstance(v, float):
                     self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_FLOAT
                     self._kv_overrides_array[i].value.float_value = v
-                elif isinstance(v, bool):
-                    self._kv_overrides_array[i].tag = llama_cpp.LLAMA_KV_OVERRIDE_BOOL
-                    self._kv_overrides_array[i].value.bool_value = v
                 else:
                     raise ValueError(f"Unknown value type for {k}: {v}")
 
-            self._kv_overrides_array_sentinel_key = b'\0'
-
-            # null array sentinel
-            self._kv_overrides_array[n_overrides].key = self._kv_overrides_array_sentinel_key
+            self._kv_overrides_array[
+                -1
+            ].key = b"\0"  # ensure sentinel element is zeroed
             self.model_params.kv_overrides = self._kv_overrides_array
 
         self.n_batch = min(n_ctx, n_batch)  # ???
@@ -342,18 +342,55 @@ def __init__(
             (n_ctx, self._n_vocab), dtype=np.single
         )
 
-        self._mirostat_mu = ctypes.c_float(2.0 * 5.0) # TODO: Move this to sampling context
+        self._mirostat_mu = ctypes.c_float(
+            2.0 * 5.0
+        )  # TODO: Move this to sampling context
 
         try:
             self.metadata = self._model.metadata()
         except Exception as e:
             self.metadata = {}
             if self.verbose:
                 print(f"Failed to load metadata: {e}", file=sys.stderr)
-        
+
         if self.verbose:
             print(f"Model metadata: {self.metadata}", file=sys.stderr)
 
+        if self.chat_format is None and self.chat_handler is None and "tokenizer.chat_template" in self.metadata:
+            chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(self.metadata)
+
+            if chat_format is not None:
+                self.chat_format = chat_format
+                if self.verbose:
+                    print(f"Guessed chat format: {chat_format}", file=sys.stderr)
+            else:
+                template = self.metadata["tokenizer.chat_template"]
+                try:
+                    eos_token_id = int(self.metadata["tokenizer.ggml.eos_token_id"])
+                except:
+                    eos_token_id = self.token_eos()
+                try:
+                    bos_token_id = int(self.metadata["tokenizer.ggml.bos_token_id"])
+                except:
+                  
EBF3
  bos_token_id = self.token_bos()
+
+                eos_token = self.detokenize([eos_token_id]).decode("utf-8")
+                bos_token = self.detokenize([bos_token_id]).decode("utf-8")
+
+                if self.verbose:
+                    print(f"Using chat template: {template}", file=sys.stderr)
+                    print(f"Using chat eos_token: {eos_token}", file=sys.stderr)
+                    print(f"Using chat bos_token: {bos_token}", file=sys.stderr)
+
+                self.chat_handler = llama_chat_format.Jinja2ChatFormatter(
+                    template=template,
+                    eos_token=eos_token,
+                    bos_token=bos_token
+                ).to_chat_handler()
+
+        if self.chat_format is None and self.chat_handler is None:
+            self.chat_format = "llama-2"
+
     @property
     def ctx(self) -> llama_cpp.llama_context_p:
         assert self._ctx.ctx is not None
@@ -550,7 +587,7 @@ def sample(
                 candidates=self._candidates,
                 tau=mirostat_tau,
                 eta=mirostat_eta,
-                mu=ctypes.pointer(self._mirostat_mu)
+                mu=ctypes.pointer(self._mirostat_mu),
             )
         else:
             self._ctx.sample_top_k(candidates=self._candidates, k=top_k, min_keep=1)