Load balancing

Verified

Distribute requests across multiple LLM providers automatically (also known as Power of Two Choices, or P2C).

About load balancing

Load balancing distributes incoming requests across multiple backend LLM providers to optimize performance, cost, and availability. agentgateway uses an intelligent Power of Two Choices (P2C) algorithm with health-aware scoring to automatically select the best available provider for each request.

Unlike simple strategies like round-robin or random selection, the P2C algorithm makes smarter routing decisions by:

Selecting two random providers from the available pool (the same provider can be selected twice, preventing starvation of the lowest-scored endpoint)
Scoring each provider based on health, latency, and pending requests
Routing to the provider with the better score

This approach provides superior performance compared to named strategies found in other AI gateways (such as “least-busy,” “least-latency,” or “cost-based” routing) without requiring you to manually select a strategy.

How P2C load balancing works

The load balancing algorithm uses several factors to score each provider:

Health score (EWMA): An exponentially-weighted moving average that tracks success rate. Each successful request records 1.0, each failure records 0.0. Recent results weigh more heavily (α = 0.3). Providers with recent failures receive lower scores.
Request latency (EWMA): Tracks average response time in seconds using the same EWMA calculation. Only successful requests contribute to latency tracking—failures are not measured to avoid skewing results with fast error responses.
Pending requests: Accounts for the number of active in-flight requests to avoid overloading busy providers. Each pending request adds a 10% penalty to the latency component.
Eviction: Temporarily removes providers that consistently fail or return rate limit errors, moving them to a rejected state until they can be retried.

The final score for each provider is calculated as:

score = health / (1 + latency_penalty)
where latency_penalty = request_latency * (1 + pending_requests * 0.1)

This scoring mechanism automatically adapts to changing conditions, routing traffic away from slow or failing providers without manual intervention.

Load balancing within priority groups

When you configure multiple priority groups (for failover or traffic splitting), load balancing applies within each priority group. The gateway:

Selects the highest-priority group with available providers
Uses P2C algorithm to choose the best provider within that group
Falls back to the next priority group if all providers in the current group are unavailable

This combines the benefits of automatic intelligent load balancing with explicit priority-based failover control.

Before you begin

Set up an agentgateway proxy.
Set up API access to each LLM provider that you want to use.

Load balance across multiple providers

Create a backend with multiple providers in the same priority group to enable load balancing.

Create an AgentgatewayBackend with multiple providers. In this example, requests are load balanced across OpenAI and Anthropic.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: loadbalanced-backend
  namespace: agentgateway-system
spec:
  ai:
    groups:
      - providers:
          - name: openai-gpt4
            openai:
              model: gpt-4o
            policies:
              auth:
                secretRef:
                  name: openai-secret
          - name: anthropic-claude
            anthropic:
              model: claude-3-5-sonnet-latest
            policies:
              auth:
                secretRef:
                  name: anthropic-secret
EOF

Create an HTTPRoute to route traffic to the backend.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: loadbalanced-route
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /chat
      backendRefs:
        - name: loadbalanced-backend
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
EOF

Send multiple requests to observe load balancing behavior.

for i in {1..10}; do
  curl "$INGRESS_GW_ADDRESS/chat" -H content-type:application/json -d '{
    "messages": [{"role": "user", "content": "Say hello"}]
  }' | jq -r '.model'
done

for i in {1..10}; do
  curl "localhost:8080/chat" -H content-type:application/json -d '{
    "messages": [{"role": "user", "content": "Say hello"}]
  }' | jq -r '.model'
done

You’ll see responses from both providers, with the P2C algorithm automatically selecting the best provider for each request based on current health and performance metrics.

Traffic splitting for A/B testing

You can use weighted backendRefs in HTTPRoute to split traffic for A/B testing or canary deployments. This is useful for comparing model performance or gradually rolling out a new model.

For a complete guide on traffic splitting patterns, see Traffic splitting.

Create separate AgentgatewayBackend resources for the stable and canary models.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: stable-backend
  namespace: agentgateway-system
spec:
  ai:
    groups:
      - providers:
          - name: stable-model
            openai:
              model: gpt-4o
            policies:
              auth:
                secretRef:
                  name: openai-secret
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: canary-backend
  namespace: agentgateway-system
spec:
  ai:
    groups:
      - providers:
          - name: canary-model
            openai:
              model: gpt-4o-mini
            policies:
              auth:
                secretRef:
                  name: openai-secret
EOF

Create an HTTPRoute with weighted backend references. This example routes 80% of traffic to the stable model and 20% to the canary model.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: test-route
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /test
      backendRefs:
        - name: stable-backend
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
          weight: 80
        - name: canary-backend
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
          weight: 20
EOF

Each backend can contain multiple providers that are load balanced using P2C within that backend, while the HTTPRoute distributes traffic between backends based on the configured weights.

Known limitations

⚠️

Rate-limit-based eviction only: Provider eviction and failover currently only trigger on 429 (Too Many Requests) responses with proper rate-limit headers (Retry-After or x-ratelimit-reset). Eviction does NOT trigger on:

503 Service Unavailable responses
Connection refused or timeout errors
DNS resolution failures
Other error codes (404, 500, etc.)

Providers that return non-429 errors receive degraded health scores (EWMA) and lower priority within their group, but are not evicted or failed over. This means traffic may still be routed to consistently failing providers, though at reduced rates.

Monitoring load balancing

Use observability features to monitor load balancing behavior:

Metrics: Track request counts and latencies per provider
Traces: View which provider handled each request
Health scores: Monitor provider health and eviction events

The gateway automatically exports OpenTelemetry metrics that include provider selection information, allowing you to verify that load balancing is working as expected.

Cleanup

You can remove the resources that you created in this guide.

kubectl delete AgentgatewayBackend loadbalanced-backend -n agentgateway-system
kubectl delete httproute loadbalanced-route -n agentgateway-system

Next steps

Configure failover with priority groups for high availability
Set up cost tracking to monitor spending across providers
Use budget limits to control costs per provider or user

Virtual keys Model failover