Rate limits

Rate limits are per-key, per-model, per-region. Tomoul prefers to throttle (429) rather than fail (503). Headers tell you exactly where you stand.

How limits are bucketed

Each (API key, model, region) triple is its own bucket. So a busy embedding workload on bge-m3 in Helsinki can't starve your chat workload on phi-4 in Frankfurt.

Within a bucket, two ceilings apply:

  • RPM — requests per minute.
  • TPM — tokens per minute (sum of input + output).

Whichever fills first throttles the call.

Response headers

Every response includes:

X-RateLimit-Limit-Requests:      600
X-RateLimit-Remaining-Requests:  599
X-RateLimit-Reset-Requests:      0.1s
X-RateLimit-Limit-Tokens:        90000
X-RateLimit-Remaining-Tokens:    89972
X-RateLimit-Reset-Tokens:        0.018s

Handling a 429

On throttle, Tomoul returns 429 Too Many Requests with Retry-After: <seconds>. Sleep, retry. We will never return 503 for capacity — only 429.

import time
from openai import RateLimitError

for attempt in range(5):
  try:
      resp = client.chat.completions.create(...)
      break
  except RateLimitError as e:
      wait = float(e.response.headers.get("Retry-After", "1"))
      time.sleep(wait)

The official OpenAI SDK already does exponential backoff on 429 for you — the snippet above is for when you're calling the API directly.

Default ceilings

PlanChat (RPM / TPM)Embeddings (RPM / TPM)
Free ($5 trial credit)60 / 30k120 / 200k
Pay-as-you-go600 / 200k1,200 / 2M
Scale (contact)CustomCustom

Raising your limit

Contact us with the model, the region, and your projected sustained TPM. Most reasonable lifts are turned around inside a business day.

Last updated 13 May 2026Edit this page on GitHub