Requests getting lost internal to Kserve

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Hi all, hope its going well. We are using KServe v0.15, serving several large LLMs on VLLM and Sglang. We're seeing fairly persistent TTFT spikes for a few different ISVCs but we have a unique user who cancels streaming requests when TTFT > 10 seconds. Requests are inherently small with max_tokens = 1000 and autoscaling is effectively disabled with minScale=maxScale=32. The vLLM instances themselves look underutilized with about an average of 5-7 running requests per instance yet our queue depth for this revision is 300+.

What did you expect to happen:

Expected scale up of instances should handle any lost requests

What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]


apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/scaleDownDelay: 180s
    autoscaling.knative.dev/scaleToZeroGracePeriod: 30s
    autoscaling.knative.dev/scaleUpDelay: 30s
    autoscaling.knative.dev/target: "80"
    prometheus.io/path: /metrics
    prometheus.io/port: "8000"
    prometheus.io/scrape: "true"
    serving.knative.dev/reconcile: "1762628506"
    serving.kserve.io/enable-prometheus-scraping: "true"

...
    containerConcurrency: 128
    containers:
    - args:
      - --model
      - 
      - --tensor-parallel-size
      - "1"
      - --max_num_batched_tokens
      - "65536"
      - --max-model-len
      - "32768"
      - --enable-chunked-prefill
      - --max-num-seqs
      - "256"
      - --num-lookahead-slots
      - "4"
      - --uvicorn-log-level
      - info
      - --speculative-config
      - '{"method":"ngram", "num_speculative_tokens":12, "prompt_lookup_min":4, "prompt_lookup_max":5}'
      command:
      - python3
      - -m
      - vllm.entrypoints.openai.api_server
      image: vllm/vllm-openai@sha256:014a95f21c9edf6abe0aea6b07353f96baa4ec291c427bb1176dc7c93a85845c
      name: kserve-container
      ports:
      - containerPort: 8000
        protocol: TCP
      resources:
        limits:
          cpu: 14375m
          memory: 175G
          nvidia.com/gpu: "1"
        requests:
          cpu: 14375m
          memory: 175G
          nvidia.com/gpu: "1"
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
    maxReplicas: 32
    minReplicas: 32
    timeout: 300
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 32Gi
      name: dshm

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I suspect its a misconfiguration between container concurrency and the actual container. Any insight would be helpful!

Environment:

Istio Version: 1.26.2
Knative Version: 1.19.6
KServe Version: v0.15.2
Kubeflow version: N/A
Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm] K8s_istio
Minikube/Kind version: N/A
Kubernetes version: (use kubectl version): v1.32.8+k3s1, v1.33.2 (kubectl client)
OS (e.g. from /etc/os-release): Ubuntu 24.04 LTS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Requests getting lost internal to Kserve #4884

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Requests getting lost internal to Kserve #4884

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions