tomoul serve

Run any model from Tomoul's catalog locally. One binary, one command. Chat, embeddings, and audio transcription served from the same OpenAI-compatible endpoint.

Usage

$ tomoul serve inkubalm-0.4b
  Downloading inkubalm-0.4b (240 MB Q8_K)...
  Loaded. Listening on http://127.0.0.1:8080 (OpenAI-compatible)

Then point any OpenAI client at http://127.0.0.1:8080/v1. If the model isn't cached, serve pulls it first.

Multimodal in one process

Unlike Ollama (chat-only), one tomoul serve exposes every modality the model class supports:

EndpointWhen it's live
/v1/chat/completionsLLMs (Llama / Phi / Qwen / InkubaLM / gpt-oss).
/v1/completionsSame models, classic completion shape.
/v1/embeddingsEmbedding models (bge-m3, mxbai-embed, sentence-transformer).
/v1/audio/transcriptionsWhisper variants.
/v1/modelsLists what this process is serving.
/healthLiveness probe.

You can run multiple serve processes on different ports (one per modality) or, in Phase 2, a single process holds multiple models and routes by the model field in the request.

Model aliases

Short names resolve to full slugs:

tomoul serve llama3     # → meta-llama/Llama-3.2-3B
tomoul serve bge-m3     # → baai/bge-m3
tomoul serve whisper    # → openai/whisper-large-v3

Aliases are curated and baked into the binary. Full slugs (provider/model) always work.

Defaults

  • Port 8080.
  • One model per serve invocation (multi-model serving is Phase 2).
  • Q8_K on CPU and Apple Silicon, Q4_0 on NVIDIA/AMD GPU (override with --quant).
  • Logs to stdout; telemetry off by default (opt in with --telemetry).

Flags

FlagDefaultNotes
--port8080Bind port.
--host127.0.0.1Bind host. Use 0.0.0.0 for LAN access.
--quantautofp16, int8, int4, q8_k, q4_0.
--deviceautocpu, cuda, metal, vulkan.
--cache~/.cache/tomoulModel weight cache dir.
--cloudoffRoute through api.tomoul.ai instead of local GPU.
--telemetryoffAnonymous usage pings — off by default.

GPU & quantization

On NVIDIA / AMD GPUs we default to Q4_0 via Vulkan compute shaders. On Apple Silicon we use Metal + Q8_K. On CPU we use SIMD (AVX2 / NEON / WASM-SIMD) with the bundled pure-Zig zblas fallback. Run tomoul doctor to see what your machine supports.

Last updated 13 May 2026Edit this page on GitHub