Chat completions

POST https://gateway.llm-stats.com/v1/chat/completions

The chat endpoint is drop-in compatible with the OpenAI SDK. Point any OpenAI client at our base URL and your existing code keeps working — we route the model to the best healthy provider behind the scenes.

Quickstart

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.llm-stats.com/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    temperature=0.7,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Request body

The request body follows the OpenAI chat completions schema. The most common fields:

model

string

required

Model ID without any provider prefix (e.g. gpt-4o, claude-opus-4-6, llama-4-405b). The router resolves it to the best provider.

messages

array

required

Conversation history. Each message has role (system, user, assistant, or tool) and content (string or content parts for multimodal models).

stream

boolean

default:"false"

When true, the response is a Server-Sent Events stream of incremental deltas — same wire format as OpenAI.

temperature

number

Sampling temperature. Range and behavior depend on the underlying model.

max_tokens

integer

Maximum tokens to generate. Capped per-model where the provider enforces a limit.

tools

array

Tool / function definitions. Forwarded verbatim to providers that support tool calling.

response_format

object

Set { "type": "json_object" } for JSON-mode, or { "type": "json_schema", "json_schema": {...} } for structured outputs (where supported).

Any other OpenAI-supported parameter (top_p, presence_penalty, frequency_penalty, seed, logprobs, stop, user, …) is forwarded to the provider when supported.

Response

The response is an OpenAI chat.completion object:

{
  "id": "chatcmpl_…",
  "object": "chat.completion",
  "created": 1730000000,
  "model": "claude-opus-4-6",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "Hi! …" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 27,
    "total_tokens": 39
  }
}

Streaming

Set stream: true to receive a text/event-stream of chat.completion.chunk events:

stream = client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Stream this."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

The stream terminates with data: [DONE], exactly like OpenAI. Failover is handled before the first byte goes out — once streaming starts, the connection sticks with the chosen provider.

Multimodal input

Models that accept images, audio, or files use the standard OpenAI content-parts shape:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    {
      "type": "image_url",
      "image_url": { "url": "https://example.com/cat.png" }
    }
  ]
}

Errors

Failures use the shared error envelope. Provider-classified failures map to standard HTTP statuses:

Status	Meaning
`400`	Validation failed (`max_tokens` too high, malformed messages, …).
`401`	Missing or invalid API key.
`403`	Key has no access to that model.
`429`	Rate limited or quota exceeded — retry after `Retry-After` seconds.
`502`	All providers returned errors. Includes the last upstream message.
`504`	All providers timed out. Safe to retry.

Overview

Endpoints

Quickstart

Request body

Response

Streaming

Multimodal input

Errors

Overview

Endpoints

​Quickstart

​Request body

​Response

​Streaming

​Multimodal input

​Errors

Quickstart

Request body

Response

Streaming

Multimodal input

Errors