Functional
- Accept
/chat/completions-style requests and forward to upstream. - Emit SSE in strict order: prompt summary → reasoning summary → final output.
- Support multiple reasoning-capable models via
model(with optionalsummary_model). - Provide a client script that consumes the SSE stream.
- Handle missing/malformed reasoning boundaries per documented assumptions.
Non-functional
- Low TTFT (prompt summary should arrive quickly).
- Failure resilience (timeouts, retries for summary calls, graceful error events).
- Developer experience (clear event schema, predictable API surface).
- End-user experience (readable summaries, stable streaming behavior).
- Maintainability (clear assumptions, capability registry).
- GatewayRequest: inbound request; fields include
model,messages,stream,summary_model,temperature,max_tokens. - UpstreamRequest: request forwarded to
/chat/completions, with injected system instructions to enforce<analysis>/<final>tags. - UpstreamStreamChunk: streamed delta data from upstream.
- SummaryTask: non-streaming request for prompt or reasoning summaries.
- GatewayEvent: SSE events emitted by the gateway (
summary.prompt,summary.reasoning,output.delta,output.done,error). - ModelCapability: per-model metadata (tag format, reasoning support, parsing strategy).
- RequestContext: in-flight state (buffers, timers, request IDs, error state).
- Endpoint:
POST /v1/chat/completions(OpenAI-compatible shape with optional extensions) - Gateway health:
GET /healthz(process liveness) - Upstream health:
GET /upstream-health(checks upstream reachability)
Request body (example):
{
"model": "reasoning-llm",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain why the sky is blue."}
],
"stream": true,
"summary_model": "fast-llm",
"temperature": 0.2
}Response (SSE events):
event: summary.prompt
data: {"text":"...","request_id":"..."}
event: summary.reasoning
data: {"text":"...","request_id":"..."}
event: output.delta
data: {"text":"...","request_id":"..."}
event: output.done
data: {"request_id":"..."}
- FastAPI gateway with SSE streaming and upstream forwarding.
- Tag parser + buffering to preserve required ordering.
- Mock upstream server for local testing and env-based config for real endpoints.
- Client script that consumes SSE events and prints sections in order.
- Tests covering ordered output, config behavior, and missing-tag fallback.
Client
|
| POST /v1/chat/completions (stream=true)
v
Gateway
|-- Call A: prompt summary (fast, non-streaming)
|-- Call B: upstream stream (tagged analysis/final)
|-- Call C: reasoning summary (fast, non-streaming)
|
+--> SSE: summary.prompt -> summary.reasoning -> output.delta -> output.done
Clone and enter the repo first:
git clone https://github.com/syranol/inference-gateway.git
cd inference-gatewayThere are three ways to test this repository: I) Mock upstream (local, no credentials required). II) Friendli Dedicated endpoint (live API, requires endpoint ID + token). III) Friendli Serverless quick test (live API, preconfigured model).
Setup
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtSteps to run The local demo uses three processes because the client talks to the gateway, and the gateway forwards to the upstream mock. All three must be running to see the expected output.
Terminal 1:
source .venv/bin/activate
make mock-upstreamTerminal 2:
source .venv/bin/activate
make run-gatewayTerminal 3:
source .venv/bin/activate
make run-clientExpected result (mock)
=== 1) Summary of the prompt ===
...
=== 2) Summary of the model's reasoning ===
...
=== 3) The model's final output ===
...
[done]
1. Setup
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtNote: you must provide the following in the next two steps:
UPSTREAM_BASE_URL(e.g.,https://api.friendli.ai/dedicated/v1)UPSTREAM_API_KEY(your Friendli token)UPSTREAM_PATHdefaults to/chat/completionsunless you override it.
2. Run gateway (terminal 1)
source .venv/bin/activate
export UPSTREAM_BASE_URL="https://api.friendli.ai/dedicated/v1"
export UPSTREAM_PATH="/chat/completions"
export UPSTREAM_API_KEY="YOUR_FRIENDLI_TOKEN"
make run-gatewayExpected result (gateway terminal)
uvicorn app.main:app --port 8000
INFO: Started server process [PID]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:PORT - "POST /v1/chat/completions HTTP/1.1" 200 OK
3. Run client (terminal 2)
source .venv/bin/activate
export UPSTREAM_API_KEY="YOUR_FRIENDLI_TOKEN"
export FRIENDLI_ENDPOINT_ID="YOUR_ENDPOINT_ID"
make run-client CLIENT_ARGS="--wake"Note: --model is optional here because the client will use FRIENDLI_ENDPOINT_ID when set.
Ensure FRIENDLI_ENDPOINT_ID is exported in the same terminal where you run make run-client.
If the endpoint is offline, --wake will send a wake request. You should see:
python3.11 client.py --wake
[info] using model: YOUR_ENDPOINT_ID
[info] checking dedicated endpoint status: YOUR_ENDPOINT_ID
[info] dedicated endpoint not ready; sending wake request
Expected result (client terminal)
python3.11 client.py --wake
[info] using model: YOUR_ENDPOINT_ID
[info] checking dedicated endpoint status: YOUR_ENDPOINT_ID
[info] dedicated endpoint is already RUNNING
=== 1) Summary of the prompt ===
...
=== 2) Summary of the model's reasoning ===
...
=== 3) The model's final output ===
...
[done]
Tip: check Dedicated endpoint readiness first:
export FRIENDLI_ENDPOINT_ID="YOUR_ENDPOINT_ID"
make check-dedicatedIf --wake fails, verify the Dedicated endpoint is not terminated (terminated endpoints must be redeployed).
Expected result (dedicated)
- Same ordered SSE sections as the mock run.
- Content varies based on model and prompt.
1. Setup
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtNote: you must provide the following in the next two steps:
UPSTREAM_BASE_URL(e.g.,https://api.friendli.ai/serverless/v1)UPSTREAM_API_KEY(your Friendli token)UPSTREAM_PATHdefaults to/chat/completionsunless you override it.
2. Run gateway (terminal 1)
source .venv/bin/activate
export UPSTREAM_BASE_URL="https://api.friendli.ai/serverless/v1"
export UPSTREAM_PATH="/chat/completions"
export UPSTREAM_API_KEY="YOUR_FRIENDLI_TOKEN"
make run-gatewayExpected result (gateway terminal)
uvicorn app.main:app --port 8000
INFO: Started server process [PID]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:PORT - "POST /v1/chat/completions HTTP/1.1" 200 OK
3. Run client (terminal 2)
source .venv/bin/activate
python3.11 client.py --url http://localhost:8000/v1/chat/completions --model meta-llama-3.1-8b-instructNote: if the recommended model is unavailable, pick another serverless model from the Friendli Suite.
Alternative (use your own serverless model):
python3.11 client.py --url http://localhost:8000/v1/chat/completions --model YOUR_SERVERLESS_MODELExpected result (serverless)
- Same ordered SSE sections as the mock run.
- Content varies based on model and prompt.
Expected result (client terminal)
=== 1) Summary of the prompt ===
FriendliAI is not specified in my knowledge database. However, I can tell you that based on your prompt, I will need more context about FriendliAI to give a thoughtful and accurate answer.
=== 2) Summary of the model's reasoning ===
Here are the key steps and features of FriendliAI in 3 bullet points:
• FriendliAI uses machine learning algorithms and NLP techniques to generate human-like responses, understanding context, sarcasm, and other nuances of human language.
• It can learn and adapt to user preferences, interests, and communication style over time, allowing for personalized interactions and responses.
• FriendliAI also has the capacity for empathy and emotional understanding, detecting and acknowledging emotions to build trust and rapport with users.
=== 3) The model's final output ===
There is no specific numeric answer to this question as it discusses a technology-based topic.
[done]
make testTest coverage includes:
- Gateway ordering and missing-tag fallback behavior.
- Configuration defaults/overrides and model allowlist enforcement.
- Summary model selection and reasoning truncation behavior.
Set env vars in your shell before starting the gateway, for example:
export UPSTREAM_BASE_URL="https://api.friendli.ai/serverless/v1"
export UPSTREAM_PATH="/chat/completions"
export UPSTREAM_API_KEY="YOUR_FRIENDLI_TOKEN"UPSTREAM_BASE_URL(default:http://localhost:8001)UPSTREAM_PATH(default:/chat/completions)UPSTREAM_API_KEY(optional)SUMMARY_MODEL_DEFAULT(optional)ALLOW_MODELS(comma-separated allowlist)REQUEST_TIMEOUT(seconds, default: 60)SUMMARY_TIMEOUT(seconds, default: 10)MAX_REASONING_CHARS(default: 8000)UPSTREAM_MAX_RETRIES(default: 3)UPSTREAM_RETRY_BACKOFF(seconds, default: 1.0)ENABLE_PARSE_REASONING(default: true; use upstream reasoning fields if present)
make mock-upstream— run local upstream mockmake run-gateway— run gateway servermake run-gateway-dev— run gateway with auto-reloadmake run-client— run client against gatewaymake test— run testsmake health-gateway— check gateway liveness (/healthz)make health-upstream— check upstream reachability (/upstream-health)make check-dedicated— check Friendli Dedicated endpoint status
This work was developed with the help of an AI coding tool as a research, brainstorming, and review aid (not as an autonomous code generator). Specifically, I used AI to:
- Brainstorm design options and weigh pros/cons (e.g., tag parsing vs heuristics, single-call vs dual-call pipelines, buffering strategies).
- Identify and close knowledge gaps (e.g., streaming SSE framing, TTFT tradeoffs, and reasoning separation patterns).
- Stress-test assumptions and failure modes before implementation.
- Assist with implementation planning and code scaffolding (gateway server, client script, parser/state machine, and test harness) while I review and integrate all changes.
All architectural decisions and the final written design were made, reviewed, and refined by me.
- Enforce
<analysis>...</analysis>and<final>...</final>boundaries via a system prompt. - If upstream provides explicit reasoning fields (e.g.,
reasoning_content), use them whenENABLE_PARSE_REASONING=true; otherwise fall back to tag parsing.
- Missing
<analysis>→ empty reasoning summary; treat stream as final output. - Missing
<final>→ treat remaining stream as reasoning; final output may be empty.
- Timeouts + retries with backoff for summary calls (prompt + reasoning) and initial upstream stream connection (retryable 5xx, e.g., 503 during warm-up).
- If prompt or reasoning summary fails, continue streaming final output and emit an
errorevent. - If prompt or reasoning summary fails or times out, emit a short fallback summary derived from the first sentences of the corresponding text buffer.
- If upstream streaming fails, emit an
errorevent with partial state and close cleanly. - On client disconnect, cancel upstream stream immediately.
- Retries apply to upstream 5xx responses (502/503/504) and request errors.
- Defaults:
UPSTREAM_MAX_RETRIES=3,UPSTREAM_RETRY_BACKOFF=1.0(exponential backoff). - This is especially useful for Dedicated endpoints warming up (503).
- Pass through
modelfrom client request. - Optional
summary_modeloverride for prompt/reasoning summaries. - Allowlist controls via
ALLOW_MODELS.
- Inline heuristics: no prompt changes, but unreliable across models.
- Schema-first prompting: deterministic parsing but can drift and break streaming.
- Single-call only: lower cost, but harder to satisfy ordering + low TTFT.