Ollama

Verified

Configure Ollama to serve local models through agentgateway. Ollama runs on a machine outside your cluster, and agentgateway routes requests to it over the network.

Before you begin

Install and set up an agentgateway proxy.
Install and run Ollama on a machine accessible from your Kubernetes cluster.
Get the IP address of the machine running Ollama.

Set up Ollama

From the cluster where you installed Ollama, make sure that you have at least one model pulled.
```
ollama list
```
If not, pull a model.
```
ollama pull llama3.2
```
Configure Ollama to accept external connections. By default, Ollama only listens on localhost. You can change this setting with the OLLAMA_HOST environment variable.
```
export OLLAMA_HOST=0.0.0.0:11434
```
⚠️
Binding Ollama to 0.0.0.0 exposes it on all network interfaces. Use firewall rules to restrict access to your Kubernetes cluster nodes only.
Restart Ollama to apply the new setting.
Verify Ollama is accessible from the machine’s network address.
```
curl http://<OLLAMA_IP>:11434/v1/models
```

Configure agentgateway to reach Ollama

Because Ollama runs outside your Kubernetes cluster, you need a headless Service and EndpointSlice to give it a stable in-cluster DNS name.

Get the IP address of the machine running Ollama.

# macOS
ipconfig getifaddr en0

# Linux
hostname -I | awk '{print $1}'

Create a headless Service and EndpointSlice that point to the external Ollama instance. Replace <OLLAMA_IP> with the actual IP address.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: agentgateway-system
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: ollama
  namespace: agentgateway-system
  labels:
    kubernetes.io/service-name: ollama
addressType: IPv4
endpoints:
- addresses:
  - <OLLAMA_IP>
ports:
- port: 11434
  protocol: TCP
EOF

Create an AgentgatewayBackend resource. The openai provider type is used because Ollama exposes an OpenAI-compatible API. The host and port fields point to the headless Service DNS name.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: ollama
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: llama3.2
      host: ollama.agentgateway-system.svc.cluster.local
      port: 11434
EOF

Review the following table to understand this configuration. For more information, see the API reference.

Setting	Description
`ai.provider.openai`	The OpenAI-compatible provider type. Ollama exposes an OpenAI-compatible API, so the `openai` type is used here.
`openai.model`	The Ollama model to use. This must match a model you pulled with `ollama pull`.
`host`	The in-cluster DNS name of the headless Service pointing to the external Ollama instance.
`port`	The port Ollama listens on. The default is `11434`.

Create an HTTPRoute to expose the Ollama backend through the gateway.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ollama
  namespace: agentgateway-system
spec:
  parentRefs:
  - name: agentgateway-proxy
    namespace: agentgateway-system
  rules:
  - backendRefs:
    - name: ollama
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Send a request to verify the setup.

curl "$INGRESS_GW_ADDRESS" \
  -H "content-type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Explain the benefits of running models locally."
      }
    ]
  }' | jq

In one terminal, start a port-forward to the gateway:

kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

In a second terminal, send a request:

curl "localhost:8080" \
  -H "content-type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Explain the benefits of running models locally."
      }
    ]
  }' | jq

Example output:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1727967462,
  "model": "llama3.2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Running models locally provides complete data privacy, no API costs or rate limits, and consistent low latency without network dependencies."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 32,
    "total_tokens": 47
  }
}

Troubleshooting

Connection refused or 503 response

What’s happening:

Requests fail with a connection error or the gateway returns a 503 response.

Why it’s happening:

The Kubernetes cluster cannot reach the Ollama instance. This is usually caused by an incorrect IP in the EndpointSlice, a firewall blocking port 11434, or Ollama not configured to accept external connections.

How to fix it:

Verify Ollama is reachable from the machine’s network address:
```
curl http://<OLLAMA_IP>:11434/v1/models
```

Check that the EndpointSlice contains the correct IP:

kubectl get endpointslice ollama -n agentgateway-system -o yaml

Test connectivity from inside the cluster:

kubectl run -it --rm debug --image=curlimages/curl --restart=Never \
  -- curl http://ollama.agentgateway-system.svc.cluster.local:11434/v1/models

Model not found

What’s happening:

The request returns an error indicating the model is not available.

Why it’s happening:

The model specified in the request or the AgentgatewayBackend resource has not been pulled in Ollama.

How to fix it:

List models available in Ollama:
```
ollama list
```
Pull the model if it is missing:
```
ollama pull llama3.2
```

Next steps

Multiple endpoints

Set up other API endpoints such as embeddings or models.

Prompt guards

Set up prompt guards for your LLM traffic.

LLM observability

View metrics and logs for LLM traffic.

Vertex AI vLLM