Get started

This guide walks you through installing NetFoundry LLM Gateway and running your first requests. By the end, you'll have the gateway proxying requests to one or more inference providers.

Installation

Choose the installation method that fits your environment.

Pre-built binaries

Pre-built binaries are available for Linux, macOS, and Windows:

Visit the GitHub Releases page.
Download the binary for your platform.
Make it executable:
```
chmod +x llm-gateway
```
Run it:
```
./llm-gateway run config.yaml
```

Install with Go

If you have Go 1.22 or later:

go install github.com/openziti/llm-gateway/cmd/llm-gateway@latest
llm-gateway run config.yaml

Build from source

Clone the repository and build the binary locally:

git clone https://github.com/openziti/llm-gateway.git
cd llm-gateway
go build -o llm-gateway ./cmd/llm-gateway
./llm-gateway run config.yaml

Examples

The examples below progress from a simple single-provider proxy to a full production configuration.

Proxy a local inference server

Start Ollama:
```
ollama serve
```

Create config.yaml:

local:
  base_url: http://localhost:11434

Start the gateway:
```
llm-gateway run config.yaml
```

Send a request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7
  }'

The gateway listens on http://localhost:8080 by default.

Route between OpenAI and Anthropic

The gateway routes requests to the correct provider by prefix-matching on the model name:

providers:
  open_ai:
    api_key: ${OPENAI_API_KEY}

  anthropic:
    api_key: ${ANTHROPIC_API_KEY}

  local:
    base_url: http://localhost:11434

Requests are routed automatically based on the model prefix: gpt-* goes to OpenAI, claude-* to Anthropic, everything else to the local provider:

# Routes to OpenAI
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello from OpenAI"}]}'

# Routes to Anthropic
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "claude-3-sonnet-20240229", "messages": [{"role": "user", "content": "Hello from Anthropic"}]}'

# Routes to local Ollama
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "messages": [{"role": "user", "content": "Hello from Ollama"}]}'

Restrict API access with virtual keys

Generate an API key:

llm-gateway genkey
# sk-gw-a1b2c3d4e5f6...

Add api_keys to your config, referencing the key and setting per-key model permissions:

api_keys:
  enabled: true
  keys:
    - name: primary
      key: ${PRIMARY_API_KEY}
      allowed_models: ["gpt-*", "claude-*"]
    - name: local-only
      key: ${LOCAL_API_KEY}
      allowed_models: ["llama*"]

providers:
  open_ai:
    api_key: ${OPENAI_API_KEY}

  anthropic:
    api_key: ${ANTHROPIC_API_KEY}

  local:
    base_url: http://localhost:11434

Clients send their key in the Authorization header:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-gw-a1b2c3d4e5f6..." \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}'

Use the Python OpenAI client

The gateway works as a drop-in replacement for the OpenAI Python client. Point base_url at the gateway and it handles provider routing transparently:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # gateway handles auth
)

# Routes to OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Routes to Anthropic (translated automatically)
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Routes to local backend (Ollama, vLLM, etc.)
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

Semantic routing

Route requests automatically based on content analysis, without requiring clients to specify a model:

routing:
  default_route: general
  semantic:
    enabled: true
    provider: local
    model: nomic-embed-text
    threshold: 0.75
    ambiguous_threshold: 0.5
  routes:
    - name: coding
      model: claude-haiku-4-5-20251001
      description: "code generation, debugging, and technical tasks"
      examples:
        - "write a python function to sort a list"
        - "debug this segfault in my C code"
    - name: general
      model: qwen3-vl:30b
      description: "general knowledge and conversation"
      examples:
        - "what is the capital of France"
        - "explain how photosynthesis works"

providers:
  local:
    base_url: http://localhost:11434

See Semantic routing for a full explanation of how routing works.

Multi-endpoint load balancing

Distribute requests across multiple inference backends:

providers:
  local:
    endpoints:
      - name: ollama-primary
        base_url: http://localhost:11434
        weight: 2
      - name: ollama-secondary
        base_url: http://localhost:11435
        weight: 1
      - name: vllm-endpoint
        base_url: http://vllm.example.com:8000
        weight: 1

  open_ai:
    api_key: ${OPENAI_API_KEY}

  anthropic:
    api_key: ${ANTHROPIC_API_KEY}

See Multi-endpoint load balancing for health check and failover options.

Connect via zrok

Share the gateway over a zrok overlay so clients can reach it without a public IP:

zrok:
  share:
    enabled: true
    mode: private

providers:
  local:
    base_url: http://localhost:11434

Or access a remote inference backend through a zrok share:

providers:
  local:
    endpoints:
      - name: remote-ollama
        zrok_share_token: ${ZROK_OLLAMA_TOKEN}

See Connect via zrok for setup details.

Production configuration

A full configuration combining multiple providers, API key authentication, semantic routing, load balancing, metrics, and zrok:

listen: "0.0.0.0:8080"

zrok:
  share:
    enabled: true
    token: ${ZROK_SHARE_TOKEN}

api_keys:
  enabled: true
  keys:
    - name: primary
      key: ${PRIMARY_API_KEY}
      allowed_models: ["gpt-*", "claude-*", "llama*"]
    - name: local-only
      key: ${LOCAL_API_KEY}
      allowed_models: ["llama*"]

providers:
  open_ai:
    api_key: ${OPENAI_API_KEY}

  anthropic:
    api_key: ${ANTHROPIC_API_KEY}

  local:
    endpoints:
      - name: ollama-primary
        base_url: http://localhost:11434
        weight: 3
      - name: ollama-secondary
        base_url: http://localhost:11435
        weight: 1
      - name: vllm-endpoint
        zrok_share_token: ${ZROK_VLLM_TOKEN}
        weight: 2

routing:
  default_route: general
  semantic:
    enabled: true
    provider: local
    model: nomic-embed-text
    threshold: 0.75
    ambiguous_threshold: 0.5
  routes:
    - name: coding
      model: claude-haiku-4-5-20251001
      description: "code generation, debugging, and technical tasks"
      examples:
        - "write a python function to sort a list"
    - name: general
      model: qwen3-vl:30b
      description: "general knowledge and conversation"
      examples:
        - "what is the capital of France"

metrics:
  enabled: true

More info

Configuration: All configuration options
Providers: How provider routing and format translation work
Multi-endpoint load balancing: Advanced load balancing strategies
Virtual API keys: Client authentication and model-level restrictions
Semantic routing: Intelligent request routing based on content
Metrics: Prometheus metrics and observability

Installation​

Pre-built binaries​

Install with Go​

Build from source​

Examples​

Proxy a local inference server​

Route between OpenAI and Anthropic​

Restrict API access with virtual keys​

Use the Python OpenAI client​

Semantic routing​

Multi-endpoint load balancing​

Connect via zrok​

Production configuration​

More info​