Rate limits
Rate limits are per-key, per-model, per-region. Tomoul prefers to throttle (429) rather than fail (503). Headers tell you exactly where you stand.
How limits are bucketed
Each (API key, model, region) triple is its own bucket. So a busy embedding
workload on bge-m3 in Helsinki can't starve your chat workload on phi-4
in Frankfurt.
Within a bucket, two ceilings apply:
- RPM — requests per minute.
- TPM — tokens per minute (sum of input + output).
Whichever fills first throttles the call.
Response headers
Every response includes:
X-RateLimit-Limit-Requests: 600
X-RateLimit-Remaining-Requests: 599
X-RateLimit-Reset-Requests: 0.1s
X-RateLimit-Limit-Tokens: 90000
X-RateLimit-Remaining-Tokens: 89972
X-RateLimit-Reset-Tokens: 0.018s
Handling a 429
On throttle, Tomoul returns 429 Too Many Requests with Retry-After: <seconds>. Sleep, retry. We will never return 503 for capacity — only
429.
import time from openai import RateLimitError for attempt in range(5): try: resp = client.chat.completions.create(...) break except RateLimitError as e: wait = float(e.response.headers.get("Retry-After", "1")) time.sleep(wait)
The official OpenAI SDK already does exponential backoff on 429 for you —
the snippet above is for when you're calling the API directly.
Default ceilings
| Plan | Chat (RPM / TPM) | Embeddings (RPM / TPM) |
|---|---|---|
| Free ($5 trial credit) | 60 / 30k | 120 / 200k |
| Pay-as-you-go | 600 / 200k | 1,200 / 2M |
| Scale (contact) | Custom | Custom |
Raising your limit
Contact us with the model, the region, and your projected sustained TPM. Most reasonable lifts are turned around inside a business day.