vLLM

Configure vLLM, a high-performance LLM serving engine, through agentgateway. This guide covers two deployment patterns.

  • External vLLM: Connect to a vLLM server running outside your Kubernetes cluster on dedicated GPU hardware.
  • In-cluster vLLM: Deploy vLLM as a workload inside your Kubernetes cluster.

Before you begin

Install and set up an agentgateway proxy.

Set up vLLM

Choose your deployment option and follow the corresponding steps to set up the vLLM server and create the required Kubernetes resources.

Option 1: External vLLM

  1. Install vLLM on a GPU-enabled machine. See the vLLM installation guide.

  2. Start the vLLM OpenAI-compatible server.

    vllm serve meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --dtype auto
  3. Verify the server is accessible.

    curl http://<VLLM_SERVER_IP>:8000/v1/models
  4. Create a headless Service and EndpointSlice that point to the external vLLM server. Replace <VLLM_SERVER_IP> with the actual IP address.

    kubectl apply -f- <<EOF
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      type: ClusterIP
      clusterIP: None
      ports:
      - port: 8000
        targetPort: 8000
        protocol: TCP
    ---
    apiVersion: discovery.k8s.io/v1
    kind: EndpointSlice
    metadata:
      name: vllm
      namespace: agentgateway-system
      labels:
        kubernetes.io/service-name: vllm
    addressType: IPv4
    endpoints:
    - addresses:
      - <VLLM_SERVER_IP>
    ports:
    - port: 8000
      protocol: TCP
    EOF

Option 2: Deploy vLLM in a Kubernetes cluster

Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.

Before you begin, make sure that your cluster has:

  • GPU nodes (NVIDIA GPUs with CUDA support).
  • NVIDIA GPU Operator or device plugin installed.
  • Sufficient GPU memory for your chosen model.

Example steps:

  1. Create a vLLM Deployment with GPU resources.

    ℹ️

    For gated models such as Llama, create a Hugging Face token secret before deploying.

    kubectl create secret generic hf-token \
      -n agentgateway-system \
      --from-literal=token=<your-hf-token>
    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm
      template:
        metadata:
          labels:
            app: vllm
        spec:
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--dtype"
            - "auto"
            ports:
            - containerPort: 8000
              name: http
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
                  optional: true
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      selector:
        app: vllm
      ports:
      - port: 8000
        targetPort: 8000
        protocol: TCP
    EOF
  2. Wait for the vLLM pod to be ready.

    ℹ️

    vLLM downloads model weights on first startup, which can take several minutes depending on model size and network speed. Monitor progress with the following command.

    kubectl logs -f deployment/vllm -n agentgateway-system
    kubectl wait --for=condition=ready pod \
      -l app=vllm \
      -n agentgateway-system \
      --timeout=300s

Create the agentgateway backend resources

These steps are the same for both external and in-cluster vLLM.

  1. Create an AgentgatewayBackend resource. The openai provider type is used because vLLM exposes an OpenAI-compatible API.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      ai:
        provider:
          openai:
            model: meta-llama/Llama-3.1-8B-Instruct
          host: vllm.agentgateway-system.svc.cluster.local
          port: 8000
    EOF

    Review the following table to understand this configuration. For more information, see the API reference.

    SettingDescription
    ai.provider.openaiThe OpenAI-compatible provider type. vLLM exposes an OpenAI-compatible API, so the openai type is used here.
    openai.modelThe model name as served by vLLM. This must match the --model argument used when starting vLLM.
    hostThe in-cluster DNS name of the Service pointing to the vLLM instance.
    portThe port vLLM listens on. The default is 8000.
  2. Create an HTTPRoute to expose the vLLM backend through the gateway.

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      parentRefs:
      - name: agentgateway-proxy
        namespace: agentgateway-system
      rules:
      - backendRefs:
        - name: vllm
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
    EOF
  3. Send a request to verify agentgateway can route to vLLM.

    curl "$INGRESS_GW_ADDRESS" \
      -H "content-type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
          {
            "role": "user",
            "content": "Explain the benefits of vLLM for serving large language models."
          }
        ]
      }' | jq

    In one terminal, start a port-forward to the gateway.

    kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

    In a second terminal, send a request.

    curl "localhost:8080" \
      -H "content-type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
          {
            "role": "user",
            "content": "Explain the benefits of vLLM for serving large language models."
          }
        ]
      }' | jq

Troubleshooting

Connection refused or 503 response

What’s happening:

The gateway returns a 503 response or requests fail with a connection error.

Why it’s happening:

For external vLLM, the cluster cannot reach the server — check the EndpointSlice IP and firewall rules. For in-cluster vLLM, the pod may still be starting or may have failed to schedule.

How to fix it:

  1. For external vLLM, verify the server is reachable and the EndpointSlice is correct:

    curl http://<VLLM_SERVER_IP>:8000/v1/models
    kubectl get endpointslice vllm -n agentgateway-system -o yaml
  2. For in-cluster vLLM, check the pod status and logs:

    kubectl get pods -l app=vllm -n agentgateway-system
    kubectl logs deployment/vllm -n agentgateway-system

Pod stuck in Pending state (in-cluster only)

What’s happening:

The vLLM pod does not start and shows a Pending status.

Why it’s happening:

No GPU nodes are available in the cluster, or the GPU resource requests cannot be satisfied.

How to fix it:

  1. Check GPU node availability:

    kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
  2. Check the pod events for scheduling errors:

    kubectl describe pod -l app=vllm -n agentgateway-system

Next steps

Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.