-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
/kind bug
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Hi all, hope its going well. We are using KServe v0.15, serving several large LLMs on VLLM and Sglang. We're seeing fairly persistent TTFT spikes for a few different ISVCs but we have a unique user who cancels streaming requests when TTFT > 10 seconds. Requests are inherently small with max_tokens = 1000 and autoscaling is effectively disabled with minScale=maxScale=32. The vLLM instances themselves look underutilized with about an average of 5-7 running requests per instance yet our queue depth for this revision is 300+.
What did you expect to happen:
Expected scale up of instances should handle any lost requests
What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
autoscaling.knative.dev/scaleDownDelay: 180s
autoscaling.knative.dev/scaleToZeroGracePeriod: 30s
autoscaling.knative.dev/scaleUpDelay: 30s
autoscaling.knative.dev/target: "80"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
serving.knative.dev/reconcile: "1762628506"
serving.kserve.io/enable-prometheus-scraping: "true"
...
containerConcurrency: 128
containers:
- args:
- --model
-
- --tensor-parallel-size
- "1"
- --max_num_batched_tokens
- "65536"
- --max-model-len
- "32768"
- --enable-chunked-prefill
- --max-num-seqs
- "256"
- --num-lookahead-slots
- "4"
- --uvicorn-log-level
- info
- --speculative-config
- '{"method":"ngram", "num_speculative_tokens":12, "prompt_lookup_min":4, "prompt_lookup_max":5}'
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
image: vllm/vllm-openai@sha256:014a95f21c9edf6abe0aea6b07353f96baa4ec291c427bb1176dc7c93a85845c
name: kserve-container
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
cpu: 14375m
memory: 175G
nvidia.com/gpu: "1"
requests:
cpu: 14375m
memory: 175G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /dev/shm
name: dshm
maxReplicas: 32
minReplicas: 32
timeout: 300
volumes:
- emptyDir:
medium: Memory
sizeLimit: 32Gi
name: dshm
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I suspect its a misconfiguration between container concurrency and the actual container. Any insight would be helpful!
Environment:
- Istio Version: 1.26.2
- Knative Version: 1.19.6
- KServe Version: v0.15.2
- Kubeflow version: N/A
- Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm] K8s_istio
- Minikube/Kind version: N/A
- Kubernetes version: (use
kubectl version): v1.32.8+k3s1, v1.33.2 (kubectl client) - OS (e.g. from
/etc/os-release): Ubuntu 24.04 LTS