Load balancing
Verified Code examples on this page have been automatically tested and verified.Distribute requests across multiple LLM providers automatically (also known as Power of Two Choices, or P2C).
About load balancing
Load balancing distributes incoming requests across multiple backend LLM providers to optimize performance, cost, and availability. agentgateway uses an intelligent Power of Two Choices (P2C) algorithm with health-aware scoring to automatically select the best available provider for each request.
Unlike simple strategies like round-robin or random selection, the P2C algorithm makes smarter routing decisions by:
- Selecting two random providers from the available pool (the same provider can be selected twice, preventing starvation of the lowest-scored endpoint)
- Scoring each provider based on health, latency, and pending requests
- Routing to the provider with the better score
This approach provides superior performance compared to named strategies found in other AI gateways (such as “least-busy,” “least-latency,” or “cost-based” routing) without requiring you to manually select a strategy.
How P2C load balancing works
The load balancing algorithm uses several factors to score each provider:
- Health score (EWMA): An exponentially-weighted moving average that tracks success rate. Each successful request records 1.0, each failure records 0.0. Recent results weigh more heavily (α = 0.3). Providers with recent failures receive lower scores.
- Request latency (EWMA): Tracks average response time in seconds using the same EWMA calculation. Only successful requests contribute to latency tracking—failures are not measured to avoid skewing results with fast error responses.
- Pending requests: Accounts for the number of active in-flight requests to avoid overloading busy providers. Each pending request adds a 10% penalty to the latency component.
- Eviction: Temporarily removes providers that consistently fail or return rate limit errors, moving them to a rejected state until they can be retried.
The final score for each provider is calculated as:
score = health / (1 + latency_penalty)
where latency_penalty = request_latency * (1 + pending_requests * 0.1)This scoring mechanism automatically adapts to changing conditions, routing traffic away from slow or failing providers without manual intervention.
Load balancing within priority groups
When you configure multiple priority groups (for failover or traffic splitting), load balancing applies within each priority group. The gateway:
- Selects the highest-priority group with available providers
- Uses P2C algorithm to choose the best provider within that group
- Falls back to the next priority group if all providers in the current group are unavailable
This combines the benefits of automatic intelligent load balancing with explicit priority-based failover control.
Before you begin
- Set up an agentgateway proxy.
- Set up API access to each LLM provider that you want to use.
Load balance across multiple providers
Create a backend with multiple providers in the same priority group to enable load balancing.
Create an AgentgatewayBackend with multiple providers. In this example, requests are load balanced across OpenAI and Anthropic.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: loadbalanced-backend namespace: agentgateway-system spec: ai: groups: - providers: - name: openai-gpt4 openai: model: gpt-4o policies: auth: secretRef: name: openai-secret - name: anthropic-claude anthropic: model: claude-3-5-sonnet-latest policies: auth: secretRef: name: anthropic-secret EOF
Create an HTTPRoute to route traffic to the backend.
kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: loadbalanced-route namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - matches: - path: type: PathPrefix value: /chat backendRefs: - name: loadbalanced-backend namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend EOF
Send multiple requests to observe load balancing behavior.
for i in {1..10}; do curl "$INGRESS_GW_ADDRESS/chat" -H content-type:application/json -d '{ "messages": [{"role": "user", "content": "Say hello"}] }' | jq -r '.model' donefor i in {1..10}; do curl "localhost:8080/chat" -H content-type:application/json -d '{ "messages": [{"role": "user", "content": "Say hello"}] }' | jq -r '.model' doneYou’ll see responses from both providers, with the P2C algorithm automatically selecting the best provider for each request based on current health and performance metrics.
Traffic splitting for A/B testing
You can use weighted backendRefs in HTTPRoute to split traffic for A/B testing or canary deployments. This is useful for comparing model performance or gradually rolling out a new model.
For a complete guide on traffic splitting patterns, see Traffic splitting.
Create separate AgentgatewayBackend resources for the stable and canary models.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: stable-backend namespace: agentgateway-system spec: ai: groups: - providers: - name: stable-model openai: model: gpt-4o policies: auth: secretRef: name: openai-secret --- apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: canary-backend namespace: agentgateway-system spec: ai: groups: - providers: - name: canary-model openai: model: gpt-4o-mini policies: auth: secretRef: name: openai-secret EOFCreate an HTTPRoute with weighted backend references. This example routes 80% of traffic to the stable model and 20% to the canary model.
kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: test-route namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - matches: - path: type: PathPrefix value: /test backendRefs: - name: stable-backend namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend weight: 80 - name: canary-backend namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend weight: 20 EOFEach backend can contain multiple providers that are load balanced using P2C within that backend, while the HTTPRoute distributes traffic between backends based on the configured weights.
Known limitations
Rate-limit-based eviction only: Provider eviction and failover currently only trigger on 429 (Too Many Requests) responses with proper rate-limit headers (Retry-After or x-ratelimit-reset). Eviction does NOT trigger on:
- 503 Service Unavailable responses
- Connection refused or timeout errors
- DNS resolution failures
- Other error codes (404, 500, etc.)
Providers that return non-429 errors receive degraded health scores (EWMA) and lower priority within their group, but are not evicted or failed over. This means traffic may still be routed to consistently failing providers, though at reduced rates.
Monitoring load balancing
Use observability features to monitor load balancing behavior:
- Metrics: Track request counts and latencies per provider
- Traces: View which provider handled each request
- Health scores: Monitor provider health and eviction events
The gateway automatically exports OpenTelemetry metrics that include provider selection information, allowing you to verify that load balancing is working as expected.
Cleanup
You can remove the resources that you created in this guide.kubectl delete AgentgatewayBackend loadbalanced-backend -n agentgateway-system
kubectl delete httproute loadbalanced-route -n agentgateway-systemNext steps
- Configure failover with priority groups for high availability
- Set up cost tracking to monitor spending across providers
- Use budget limits to control costs per provider or user