-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
System Info
I was using v2.3.1 via docker and everything was working. When I updated to later versions including the latest my TGI doesn't start due to an error:
2024-12-12T14:26:52.973549Z INFO hf_hub: Token file not found "/data/token"
2024-12-12T14:26:54.846408Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2024-12-12T14:26:54.846426Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
2024-12-12T14:26:54.846433Z INFO text_generation_launcher: Sharding model on 2 processes
2024-12-12T14:26:54.931439Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 30821
2024-12-12T14:26:54.931470Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-12-12T14:26:54.931727Z INFO download: text_generation_launcher: Starting check and download process for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:57.914690Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-12-12T14:26:58.250499Z INFO download: text_generation_launcher: Successfully downloaded weights for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:58.251011Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-12T14:26:58.251055Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-12-12T14:27:00.870304Z INFO text_generation_launcher: Using prefix caching = False
2024-12-12T14:27:00.870362Z INFO text_generation_launcher: Using Attention = flashdecoding
2024-12-12T14:27:06.425419Z INFO text_generation_launcher: Using prefill chunking = True
2024-12-12T14:27:06.535239Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-12T14:27:06.536669Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-12-12T14:27:06.572585Z INFO shard-manager: text_generation_launcher: Shard ready in 8.307980962s rank=0
2024-12-12T14:27:06.578046Z INFO shard-manager: text_generation_launcher: Shard ready in 8.308372036s rank=1
2024-12-12T14:27:06.657793Z INFO text_generation_launcher: Starting Webserver
2024-12-12T14:27:06.739409Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-12T14:27:06.863722Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-12T14:27:07.034243Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
block_tables_to_padded(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
triton_block_tables_to_padded[grid](
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
device = driver.active.get_current_device()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
self.utils = CudaUtils() # TODO: make static
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.034772Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078269Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/py
57AE
thon3.11/site-packages/typer/main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
block_tables_to_padded(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
triton_block_tables_to_padded[grid](
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
device = driver.active.get_current_device()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
self.utils = CudaUtils() # TODO: make static
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078823Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
Error: Backend(Warmup(Generation("Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.")))
2024-12-12T14:27:07.117285Z ERROR text_generation_launcher: Webserver Crashed
2024-12-12T14:27:07.117316Z INFO text_generation_launcher: Shutting down shards
2024-12-12T14:27:07.173251Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-12T14:27:07.173312Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-12T14:27:07.178761Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-12-12T14:27:07.178820Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-12-12T14:27:08.279806Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed
2024-12-12T14:27:08.474404Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
This is my nvidia-smi
output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:0D:00.0 Off | 0 |
| N/A 54C P0 30W / 72W | 1557MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:37:00.0 Off | 0 |
| N/A 55C P0 28W / 72W | 21989MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:4A:00.0 Off | 0 |
| N/A 39C P0 27W / 72W | 21659MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:61:00.0 Off | 0 |
| N/A 37C P0 27W / 72W | 19965MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L4 On | 00000000:A0:00.0 Off | 0 |
| N/A 46C P8 17W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L4 On | 00000000:B5:00.0 Off | 0 |
| N/A 48C P0 22W / 72W | 193MiB / 23034MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L4 On | 00000000:CA:00.0 Off | 0 |
| N/A 28C P8 12W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L4 On | 00000000:E1:00.0 Off | 0 |
| N/A 26C P8 12W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 137174 C /app/.venv/bin/python 1548MiB |
| 1 N/A N/A 13513 C /opt/conda/bin/python3.11 21980MiB |
| 2 N/A N/A 13518 C /opt/conda/bin/python3.11 21650MiB |
| 3 N/A N/A 13523 C /opt/conda/bin/python3.11 19956MiB |
| 5 N/A N/A 2150019 C /opt/conda/bin/python3.11 184MiB |
+-----------------------------------------------------------------------------------------+
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
here is the TGI env:
{
model_id: "microsoft/Phi-3.5-mini-instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
2,
),
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "06ee66ffa08d",
port: 3000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
And here is how I'm running container (running it via podman):
podman create --name=tgi_container --security-opt label=disable --label io.podman.compose.config-hash=XXXXXXXX --label io.podman.compose.project=some-deployment --label io.podman.compose.version=1.0.6 --label PODMAN_SYSTEMD_UNIT=podman-compose@some-deployment.service --label com.docker.compose.project=some-deployment --label com.docker.compose.project.working_dir=/data/some-deployment --label com.docker.compose.project.config_files=docker-compose.yml --label com.docker.compose.container-number=1 --label com.docker.compose.service=tgi --device nvidia.com/gpu=4 --device nvidia.com/gpu=5 -e HUGGING_FACE_HUB_TOKEN=hf_XXXXXXX -e FLASH_DECODING=1 -e PREFILL_CHUNKING=1 -e NCCL_DEBUG=INFO -v /data/tgi/data:/data --net some-deployment_api --network-alias tgi --expose 3000 -p 3000:3000 --shm-size 10gb --restart on-failure ghcr.io/huggingface/text-generation-inference:3.0.1 --port 3000 --model-id microsoft/Phi-3.5-mini-instruct --num-shard 2
Which is generated on my system from running it via a docker compose file.
Expected behavior
The TGI server to start correctly and normally as it had before adding the Triton kernels!
KreshLaDoge
Metadata
Metadata
Assignees
Labels
No labels