Load balancing

Verified Code examples on this page have been automatically tested and verified.

Distribute requests across multiple LLM providers automatically (also known as Power of Two Choices, or P2C).

About load balancing

Load balancing distributes incoming requests across multiple backend LLM providers to optimize performance, cost, and availability. agentgateway uses an intelligent Power of Two Choices (P2C) algorithm with health-aware scoring to automatically select the best available provider for each request.

Unlike simple strategies like round-robin or random selection, the P2C algorithm makes smarter routing decisions by:

  1. Selecting two random providers from the available pool (the same provider can be selected twice, preventing starvation of the lowest-scored endpoint)
  2. Scoring each provider based on health, latency, and pending requests
  3. Routing to the provider with the better score

This approach provides superior performance compared to named strategies found in other AI gateways (such as “least-busy,” “least-latency,” or “cost-based” routing) without requiring you to manually select a strategy.

How P2C load balancing works

The load balancing algorithm uses several factors to score each provider:

  • Health score (EWMA): An exponentially-weighted moving average that tracks success rate. Each successful request records 1.0, each failure records 0.0. Recent results weigh more heavily (α = 0.3). Providers with recent failures receive lower scores.
  • Request latency (EWMA): Tracks average response time in seconds using the same EWMA calculation. Only successful requests contribute to latency tracking—failures are not measured to avoid skewing results with fast error responses.
  • Pending requests: Accounts for the number of active in-flight requests to avoid overloading busy providers. Each pending request adds a 10% penalty to the latency component.
  • Eviction: Temporarily removes providers that consistently fail or return rate limit errors, moving them to a rejected state until they can be retried.

The final score for each provider is calculated as:

score = health / (1 + latency_penalty)
where latency_penalty = request_latency * (1 + pending_requests * 0.1)

This scoring mechanism automatically adapts to changing conditions, routing traffic away from slow or failing providers without manual intervention.

Load balancing within priority groups

When you configure multiple priority groups (for failover or traffic splitting), load balancing applies within each priority group. The gateway:

  1. Selects the highest-priority group with available providers
  2. Uses P2C algorithm to choose the best provider within that group
  3. Falls back to the next priority group if all providers in the current group are unavailable

This combines the benefits of automatic intelligent load balancing with explicit priority-based failover control.

Before you begin

  1. Set up an agentgateway proxy.
  2. Set up API access to each LLM provider that you want to use.

Load balance across multiple providers

Create a backend with multiple providers in the same priority group to enable load balancing.

  1. Create an AgentgatewayBackend with multiple providers. In this example, requests are load balanced across OpenAI and Anthropic.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: loadbalanced-backend
      namespace: agentgateway-system
    spec:
      ai:
        groups:
          - providers:
              - name: openai-gpt4
                openai:
                  model: gpt-4o
                policies:
                  auth:
                    secretRef:
                      name: openai-secret
              - name: anthropic-claude
                anthropic:
                  model: claude-3-5-sonnet-latest
                policies:
                  auth:
                    secretRef:
                      name: anthropic-secret
    EOF
  1. Create an HTTPRoute to route traffic to the backend.

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: loadbalanced-route
      namespace: agentgateway-system
    spec:
      parentRefs:
        - name: agentgateway-proxy
          namespace: agentgateway-system
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /chat
          backendRefs:
            - name: loadbalanced-backend
              namespace: agentgateway-system
              group: agentgateway.dev
              kind: AgentgatewayBackend
    EOF
  1. Send multiple requests to observe load balancing behavior.

    for i in {1..10}; do
      curl "$INGRESS_GW_ADDRESS/chat" -H content-type:application/json -d '{
        "messages": [{"role": "user", "content": "Say hello"}]
      }' | jq -r '.model'
    done
    for i in {1..10}; do
      curl "localhost:8080/chat" -H content-type:application/json -d '{
        "messages": [{"role": "user", "content": "Say hello"}]
      }' | jq -r '.model'
    done

    You’ll see responses from both providers, with the P2C algorithm automatically selecting the best provider for each request based on current health and performance metrics.

Traffic splitting for A/B testing

You can use weighted backendRefs in HTTPRoute to split traffic for A/B testing or canary deployments. This is useful for comparing model performance or gradually rolling out a new model.

For a complete guide on traffic splitting patterns, see Traffic splitting.

  1. Create separate AgentgatewayBackend resources for the stable and canary models.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: stable-backend
      namespace: agentgateway-system
    spec:
      ai:
        groups:
          - providers:
              - name: stable-model
                openai:
                  model: gpt-4o
                policies:
                  auth:
                    secretRef:
                      name: openai-secret
    ---
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: canary-backend
      namespace: agentgateway-system
    spec:
      ai:
        groups:
          - providers:
              - name: canary-model
                openai:
                  model: gpt-4o-mini
                policies:
                  auth:
                    secretRef:
                      name: openai-secret
    EOF
  2. Create an HTTPRoute with weighted backend references. This example routes 80% of traffic to the stable model and 20% to the canary model.

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: test-route
      namespace: agentgateway-system
    spec:
      parentRefs:
        - name: agentgateway-proxy
          namespace: agentgateway-system
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /test
          backendRefs:
            - name: stable-backend
              namespace: agentgateway-system
              group: agentgateway.dev
              kind: AgentgatewayBackend
              weight: 80
            - name: canary-backend
              namespace: agentgateway-system
              group: agentgateway.dev
              kind: AgentgatewayBackend
              weight: 20
    EOF

    Each backend can contain multiple providers that are load balanced using P2C within that backend, while the HTTPRoute distributes traffic between backends based on the configured weights.

Known limitations

⚠️

Rate-limit-based eviction only: Provider eviction and failover currently only trigger on 429 (Too Many Requests) responses with proper rate-limit headers (Retry-After or x-ratelimit-reset). Eviction does NOT trigger on:

  • 503 Service Unavailable responses
  • Connection refused or timeout errors
  • DNS resolution failures
  • Other error codes (404, 500, etc.)

Providers that return non-429 errors receive degraded health scores (EWMA) and lower priority within their group, but are not evicted or failed over. This means traffic may still be routed to consistently failing providers, though at reduced rates.

Monitoring load balancing

Use observability features to monitor load balancing behavior:

  • Metrics: Track request counts and latencies per provider
  • Traces: View which provider handled each request
  • Health scores: Monitor provider health and eviction events

The gateway automatically exports OpenTelemetry metrics that include provider selection information, allowing you to verify that load balancing is working as expected.

Cleanup

You can remove the resources that you created in this guide.
kubectl delete AgentgatewayBackend loadbalanced-backend -n agentgateway-system
kubectl delete httproute loadbalanced-route -n agentgateway-system

Next steps

  • Configure failover with priority groups for high availability
  • Set up cost tracking to monitor spending across providers
  • Use budget limits to control costs per provider or user
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.