Streaming completions

Stream tokens as they're generated. Lower perceived latency, smoother UX. Server-Sent Events on the wire.

Enable streaming

Set "stream": true in any chat-completions call.

Python pattern

stream = client.chat.completions.create(
    model="microsoft/phi-4",
    messages=[{"role": "user", "content": "Count to 10."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Node pattern

const stream = await client.chat.completions.create({
  model: "microsoft/phi-4",
  messages: [{ role: "user", content: "Count to 10." }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Raw SSE

Each chunk is a JSON object on a data: line. The stream ends with data: [DONE]. A minimal raw consumer:

curl -N https://api.tomoul.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOMOUL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/phi-4",
    "messages": [{"role":"user","content":"Count to 10."}],
    "stream": true
  }'

Cancelling

  • Python SDK: stream.close().
  • Node SDK: pass { signal: controller.signal } and call controller.abort().
  • Raw HTTP: close the TCP connection.

Closing the stream halts generation server-side and stops the meter mid-token — you're not billed for the unsent tail.

Why prefer streaming?

Perceived latency is dominated by time-to-first-token, not total tokens per second. Streaming cuts perceived latency from "wait 4 seconds for an answer" to "see the answer start in 200ms." Cheap UX win.

Last updated 13 May 2026Edit this page on GitHub