Compute perplexity fails with too many tokens exception

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

It is supposed to compute perplexity like the original PR: #270

Current Behavior

However, it fails with the following exception:

llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector

Environment and Context

macOS (M2 Max)

$ python3 --version 3.8.16
$ make --version i386-apple-darwin11.3.0
$ g++ --version arm64-apple-darwin22.3.0

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

git pull
make
python3 convert-pth-to-ggml.py models/7B/ 1
python3 quantize.py 7B
./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

Failure Logs

llama.cpp % ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

main: seed = 1679472306
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector
zsh: abort      ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions