Streaming completions
Stream tokens as they're generated. Lower perceived latency, smoother UX. Server-Sent Events on the wire.
Enable streaming
Set "stream": true in any chat-completions call.
Python pattern
stream = client.chat.completions.create(
model="microsoft/phi-4",
messages=[{"role": "user", "content": "Count to 10."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Node pattern
const stream = await client.chat.completions.create({
model: "microsoft/phi-4",
messages: [{ role: "user", content: "Count to 10." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Raw SSE
Each chunk is a JSON object on a data: line. The stream ends with
data: [DONE]. A minimal raw consumer:
curl -N https://api.tomoul.ai/v1/chat/completions \
-H "Authorization: Bearer $TOMOUL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/phi-4",
"messages": [{"role":"user","content":"Count to 10."}],
"stream": true
}'
Cancelling
- Python SDK:
stream.close(). - Node SDK: pass
{ signal: controller.signal }and callcontroller.abort(). - Raw HTTP: close the TCP connection.
Closing the stream halts generation server-side and stops the meter mid-token — you're not billed for the unsent tail.
Why prefer streaming?
Perceived latency is dominated by time-to-first-token, not total tokens per second. Streaming cuts perceived latency from "wait 4 seconds for an answer" to "see the answer start in 200ms." Cheap UX win.
Last updated 13 May 2026Edit this page on GitHub