LLM Relay Server

An OpenAI-compatible API relay server with intelligent context management for long coding sessions. Works with any LLM provider (Together AI, OpenAI, Anthropic, etc.) and any model. Prevents context overflow errors while providing middleware capabilities like authentication, logging, and rate limiting.

Default Configuration: This relay ships with MiniMaxAI/MiniMax-M2.5 via Together AI as the default model, but can be configured to use any model by changing one line of code.

Why This Relay?

The Problem

Modern AI coding assistants (like OpenCode) maintain long conversation histories. When these exceed model context limits (~100k-200k tokens depending on model), one of three things happen:

Hard failure - Request rejected with "context too long" error
Silent truncation - Provider drops old messages without warning
Expensive context loss - Every request reprocesses full history from scratch

The Solution

This relay acts as a model-agnostic middleware layer between your AI client and any LLM provider, providing:

Request Management:

Authentication handling - Securely manages API keys, client sends "dummy", relay adds real provider key
Rate limiting foundation - Extensible middleware structure for adding custom rate limits
Structured logging - All requests logged with timestamps, token counts, and truncation events
Model mapping - Any OpenAI model name maps to your configured backend model

Context Intelligence:

Real-time context monitoring - Estimates token count on every request
Smart truncation - Keeps system prompt + last 7 messages when approaching limits
Message size protection - Caps individual messages at 20k tokens
Streaming support - Full SSE streaming with proper error handling

Result: Long coding sessions continue uninterrupted regardless of which model you use.

Result: Long coding sessions continue uninterrupted, with Together AI's Cache-Aware Disaggregation providing 80% cost savings on repeated context via their KV-cache reuse.

Architecture

┌─────────────┐      ┌─────────────────┐      ┌─────────────────┐      ┌─────────────┐
│   OpenCode  │ ───► │   Go Relay      │ ───► │   LLM Provider  │ ───► │   Any LLM   │
│   (Client)  │      │   (This Server) │      │  (Together AI,  │      │  (MiniMax,  │
└─────────────┘      └─────────────────┘      │   OpenAI, etc)  │      │  GPT-4, etc)│
                           │                  └─────────────────┘      └─────────────┘
                           │
                           ├─ 1. Receive OpenAI-format request
                           ├─ 2. Authentication & validation
                           ├─ 3. Estimate token count
                           ├─ 4. Truncate if >180k tokens
                           ├─ 5. Cap individual messages >20k
                           ├─ 6. Structured logging
                           ├─ 7. Map model name → Configured backend
                           └─ 8. Forward to LLM provider

Middleware Layer Capabilities

The relay acts as a model-agnostic secure middleware layer providing:

Feature	Description	Benefit
Authentication	Client uses dummy key, relay injects real provider key	Secure credential management
Request Logging	Structured logs: method, path, tokens, truncations	Observability & debugging
Rate Limiting	Extensible middleware (add your own limits)	Abuse prevention
Context Guardrails	Enforces 180k total, 20k per-message limits	Prevents context overflow errors
Model Normalization	Maps any model name to configured backend	OpenAI-compatible drop-in
Streaming Proxy	Full SSE support with error handling	Real-time responses
Health Monitoring	`/health` endpoint for load balancers	Production readiness

Context Management Flow

Request with 200 messages (~250k tokens)
              │
              ▼
    ┌──────────────────┐
    │  Per-message     │  Messages >20k tokens truncated
    │  Truncation      │  "[...content truncated]"
    └──────────────────┘
              │
              ▼
    ┌──────────────────┐
    │  Context Check   │  Total >180k tokens?
    │                  │  Yes → Keep system + last 7
    └──────────────────┘  No  → Pass through
              │
              ▼
    ┌──────────────────┐
    │  Together AI     │  CPD automatically caches
    │  CPD System      │  repeated context prefixes
    └──────────────────┘
              │
              ▼
    Streaming response back to client

Quick Start

1. Get Together AI API Key

Sign up at together.ai and create an API key.

2. Configure Environment

cp .env.example .env
# Edit .env:
# LLM_PROVIDER_URL=https://api.together.xyz
# LLM_PROVIDER_KEY=your_together_api_key_here

3. Run with Docker

docker run -d \
  -p 8080:8080 \
  -e LLM_PROVIDER_URL=https://api.together.xyz \
  -e LLM_PROVIDER_KEY=your_key \
  laithaj/gorelayserve:latest

Or use docker-compose:

docker compose up -d

4. Configure OpenCode

Add to ~/.opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "go-relay": {
      "name": "Go Relay (Together AI)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8080/v1",
        "apiKey": "dummy"
      },
      "models": {
        "gpt-4": {
          "name": "MiniMax M2.5",
          "options": { "model": "gpt-4" },
          "limit": { "context": 196608, "output": 8192 }
        }
      }
    }
  },
  "model": "go-relay/gpt-4"
}

5. Use

opencode

Or test with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

Configuration

Environment Variables

Variable	Required	Default	Description
`LLM_PROVIDER_URL`	Yes	-	Together AI endpoint (`https://api.together.xyz`)
`LLM_PROVIDER_KEY`	Yes	-	Together AI API key
`RELAY_PORT`	No	`8080`	Server port

Context Management Limits

All limits are configurable in internal/proxy/proxy.go. Default values work for 192k context models:

Limit	Default	Behavior	Customize For...
Total context	180k tokens	Truncate to system + last 7	100k/128k context models
Per-message	20k tokens	Truncate with notice	Specific use cases
Messages kept	8 total	System + last 7	More/less history

Adjust limits in code (internal/proxy/proxy.go):

if totalTokens > 180000 {        // Change threshold
    newMessages = messages[len(messages)-7:]  // Change 7 to keep more/less
}

Features

Middleware & Security

✓ Authentication proxy - Secure API key handling (client dummy key → real provider key)
✓ Structured logging - Request/response logging with token counts and truncation events
✓ Rate limiting ready - Extensible middleware for adding custom rate limits
✓ Health checks - HTTP /health endpoint for load balancers and monitoring
✓ Non-root container - Runs as unprivileged user for production security

Context Management

✓ Smart truncation - Keeps system prompt + last 7 messages when approaching 180k limit
✓ Message size protection - Caps individual messages at 20k tokens
✓ Real-time token estimation - Monitors context size on every request
✓ Model mapping - Any OpenAI model name maps to your configured backend

API Compatibility

✓ OpenAI-compatible - Drop-in replacement for OpenAI SDK
✓ Streaming support - Full Server-Sent Events (SSE) for real-time responses
✓ Error handling - Proper HTTP status codes and error messages
✓ Cost optimized - Leverages Together AI's CPD for 80% savings on cached input

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (OpenAI-compatible)
`/health`	GET	Health check (returns `ok`)

Using Different Models

This relay is model-agnostic. To use a different model, edit internal/proxy/proxy.go:

// Change this line to any model supported by your provider
var defaultModel = "meta-llama/Llama-3.3-70B-Instruct-Turbo"  // Example: Llama
// var defaultModel = "Qwen/Qwen2.5-72B-Instruct-Turbo"        // Example: Qwen
// var defaultModel = "gpt-4o"                                  // Example: OpenAI (via compatible endpoint)

Then rebuild and push:

docker build -t your-image:latest .
docker push your-image:latest

Default: MiniMax M2.5

The relay ships with MiniMax M2.5 as the default because:

192k context - Handles large codebases
456B MoE architecture - Efficient inference
Together AI CPD - 80% cost savings on cached context ($0.06/M vs $0.30/M)
No cold starts - Unlike 600B+ parameter models

MiniMax M2.5 Specs:

Parameters: 456B MoE
Context: 192,000 tokens
Output: Up to 8,192 tokens
Pricing: $0.30/M input (new), $0.06/M input (cached), $1.20/M output
Best for: Coding, reasoning, long-context tasks

Troubleshooting

"Context too long" errors

The relay should prevent this. Check logs:

docker logs go_relay | grep TRUNCATE

Model returns garbled tool calls

Some models (like MiniMax M2.5) have limited tool support. For file operations, ask the model to output content directly:

"Output the markdown directly without using tools. I'll copy it myself."

Check relay health

curl http://localhost:8080/health

Development

# Clone
git clone https://github.com/Layyyth/GORelayServe.git
cd GORelayServe

# Build
go build -o relay ./cmd/relay/main.go

# Run
./relay

License

MIT License - See LICENSE file.

Author

Created by Laith AbuJaafar

Compatible with any OpenAI-compatible LLM provider. Ships with Together AI + MiniMax M2.5 as default configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cmd/relay		cmd/relay
internal/proxy		internal/proxy
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

LLM Relay Server

Why This Relay?

The Problem

The Solution

Architecture

Middleware Layer Capabilities

Context Management Flow

Quick Start

1. Get Together AI API Key

2. Configure Environment

3. Run with Docker

4. Configure OpenCode

5. Use

Configuration

Environment Variables

Context Management Limits

Features

Middleware & Security

Context Management

API Compatibility

API Endpoints

Using Different Models

Default: MiniMax M2.5

Troubleshooting

"Context too long" errors

Model returns garbled tool calls

Check relay health

Development

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages