8000 Requests getting lost internal to Kserve · Issue #4884 · kserve/kserve · GitHub
[go: up one dir, main page]

Skip to content

Requests getting lost internal to Kserve #4884

@darwich6

Description

@darwich6

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Hi all, hope its going well. We are using KServe v0.15, serving several large LLMs on VLLM and Sglang. We're seeing fairly persistent TTFT spikes for a few different ISVCs but we have a unique user who cancels streaming requests when TTFT > 10 seconds. Requests are inherently small with max_tokens = 1000 and autoscaling is effectively disabled with minScale=maxScale=32. The vLLM instances themselves look underutilized with about an average of 5-7 running requests per instance yet our queue depth for this revision is 300+.

What did you expect to happen:

Expected scale up of instances should handle any lost requests

What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]


apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/scaleDownDelay: 180s
    autoscaling.knative.dev/scaleToZeroGracePeriod: 30s
    autoscaling.knative.dev/scaleUpDelay: 30s
    autoscaling.knative.dev/target: "80"
    prometheus.io/path: /metrics
    prometheus.io/port: "8000"
    prometheus.io/scrape: "true"
    serving.knative.dev/reconcile: "1762628506"
    serving.kserve.io/enable-prometheus-scraping: "true"

...
    containerConcurrency: 128
    containers:
    - args:
      - --model
      - 
      - --tensor-parallel-size
      - "1"
      - --max_num_batched_tokens
      - "65536"
      - --max-model-len
      - "32768"
      - --enable-chunked-prefill
      - --max-num-seqs
      - "256"
      - --num-lookahead-slots
      - "4"
      - --uvicorn-log-level
      - info
      - --speculative-config
      - '{"method":"ngram", "num_speculative_tokens":12, "prompt_lookup_min":4, "prompt_lookup_max":5}'
      command:
      - python3
      - -m
      - vllm.entrypoints.openai.api_server
      image: vllm/vllm-openai@sha256:014a95f21c9edf6abe0aea6b07353f96baa4ec291c427bb1176dc7c93a85845c
      name: kserve-container
      ports:
      - containerPort: 8000
        protocol: TCP
      resources:
        limits:
          cpu: 14375m
          memory: 175G
          nvidia.com/gpu: "1"
        requests:
          cpu: 14375m
          memory: 175G
          nvidia.com/gpu: "1"
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
    maxReplicas: 32
    minReplicas: 32
    timeout: 300
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 32Gi
      name: dshm

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I suspect its a misconfiguration between container concurrency and the actual container. Any insight would be helpful!

Environment:

  • Istio Version: 1.26.2
  • Knative Version: 1.19.6
  • KServe Version: v0.15.2
  • Kubeflow version: N/A
  • Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm] K8s_istio
  • Minikube/Kind version: N/A
  • Kubernetes version: (use kubectl version): v1.32.8+k3s1, v1.33.2 (kubectl client)
  • OS (e.g. from /etc/os-release): Ubuntu 24.04 LTS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0