You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: llama_cpp/server/app.py
+71-54Lines changed: 71 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -70,6 +70,55 @@ def get_llama():
70
70
description="The model to use for generating completions."
71
71
)
72
72
73
+
max_tokens_field=Field(
74
+
default=16,
75
+
ge=1,
76
+
le=2048,
77
+
description="The maximum number of tokens to generate."
78
+
)
79
+
80
+
temperature_field=Field(
81
+
default=0.8,
82
+
ge=0.0,
83
+
le=2.0,
84
+
description="Adjust the randomness of the generated text.\n\n"+
85
+
"Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run."
86
+
)
87
+
88
+
top_p_field=Field(
89
+
default=0.95,
90
+
ge=0.0,
91
+
le=1.0,
92
+
description="Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.\n\n"+
93
+
"Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text."
94
+
)
95
+
96
+
stop_field=Field(
97
+
default=None,
98
+
description="A list of tokens at which to stop generation. If None, no stop tokens are used."
99
+
)
100
+
101
+
stream_field=Field(
102
+
default=False,
103
+
description="Whether to stream the results as they are generated. Useful for chatbots."
104
+
)
105
+
106
+
top_k_field=Field(
107
+
default=40,
108
+
ge=0,
109
+
description="Limit the next token selection to the K most probable tokens.\n\n"+
110
+
"Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text."
111
+
)
112
+
113
+
repeat_penalty_field=Field(
114
+
default=1.0,
115
+
ge=0.0,
116
+
description="A penalty applied to each token that is already generated. This helps prevent the model from repeating itself.\n\n"+
117
+
"Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient."
118
+
)
119
+
120
+
121
+
73
122
classCreateCompletionRequest(BaseModel):
74
123
prompt: Union[str, List[str]] =Field(
75
124
default="",
@@ -79,62 +128,27 @@ class CreateCompletionRequest(BaseModel):
79
128
default=None,
80
129
description="A suffix to append to the generated text. If None, no suffix is appended. Useful for chatbots."
81
130
)
82
-
max_tokens: int=Field(
83
-
default=16,
84
-
ge=1,
85
-
le=2048,
86
-
description="The maximum number of tokens to generate."
87
-
)
88
-
temperature: float=Field(
89
-
default=0.8,
90
-
ge=0.0,
91
-
le=2.0,
92
-
description="Adjust the randomness of the generated text.\n\n"+
93
-
"Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run."
94
-
)
95
-
top_p: float=Field(
96
-
default=0.95,
97
-
ge=0.0,
98
-
le=1.0,
99
-
description="Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.\n\n"+
100
-
"Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text."
101
-
)
131
+
max_tokens: int=max_tokens_field
132
+
temperature: float=temperature_field
133
+
top_p: float=top_p_field
102
134
echo: bool=Field(
103
135
default=False,
104
136
description="Whether to echo the prompt in the generated text. Useful for chatbots."
105
137
)
106
-
stop: Optional[List[str]] =Field(
107
-
default=None,
108
-
description="A list of tokens at which to stop generation. If None, no stop tokens are used."
109
-
)
110
-
stream: bool=Field(
111
-
default=False,
112
-
description="Whether to stream the results as they are generated. Useful for chatbots."
113
-
)
138
+
stop: Optional[List[str]] =stop_field
139
+
stream: bool=stream_field
114
140
logprobs: Optional[int] =Field(
115
141
default=None,
116
142
ge=0,
117
143
description="The number of logprobs to generate. If None, no logprobs are generated."
118
144
)
119
145
120
-
121
-
122
146
# ignored, but marked as required for the sake of compatibility with openai's api
123
147
model: str=model_field
124
148
125
149
# llama.cpp specific parameters
126
-
top_k: int=Field(
127
-
default=40,
128
-
ge=0,
129
-
description="Limit the next token selection to the K most probable tokens.\n\n"+
130
-
"Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text."
131
-
)
132
-
repeat_penalty: float=Field(
133
-
default=1.0,
134
-
ge=0.0,
135
-
description="A penalty applied to each token that is already generated. This helps prevent the model from repeating itself.\n\n"+
136
-
"Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient."
0 commit comments