tomoul serve

Run any model from Tomoul's catalog locally. One binary, one command. Chat, embeddings, and audio transcription served from the same OpenAI-compatible endpoint.

Usage

$ tomoul serve inkubalm-0.4b
⬇  Downloading inkubalm-0.4b (240 MB Q8_K)...
✓  Loaded. Listening on http://127.0.0.1:8080 (OpenAI-compatible)

Then point any OpenAI client at http://127.0.0.1:8080/v1. If the model isn't cached, serve pulls it first.

Multimodal in one process

Unlike Ollama (chat-only), one tomoul serve exposes every modality the model class supports:

Endpoint	When it's live
`/v1/chat/completions`	LLMs (Llama / Phi / Qwen / InkubaLM / gpt-oss).
`/v1/completions`	Same models, classic completion shape.
`/v1/embeddings`	Embedding models (`bge-m3`, `mxbai-embed`, sentence-transformer).
`/v1/audio/transcriptions`	Whisper variants.
`/v1/models`	Lists what this process is serving.
`/health`	Liveness probe.

You can run multiple serve processes on different ports (one per modality) or, in Phase 2, a single process holds multiple models and routes by the model field in the request.

Model aliases

Short names resolve to full slugs:

tomoul serve llama3     # → meta-llama/Llama-3.2-3B
tomoul serve bge-m3     # → baai/bge-m3
tomoul serve whisper    # → openai/whisper-large-v3

Aliases are curated and baked into the binary. Full slugs (provider/model) always work.

Defaults

Port 8080.
One model per serve invocation (multi-model serving is Phase 2).
Q8_K on CPU and Apple Silicon, Q4_0 on NVIDIA/AMD GPU (override with --quant).
Logs to stdout; telemetry off by default (opt in with --telemetry).

Flags

Flag	Default	Notes
`--port`	`8080`	Bind port.
`--host`	`127.0.0.1`	Bind host. Use `0.0.0.0` for LAN access.
`--quant`	auto	`fp16`, `int8`, `int4`, `q8_k`, `q4_0`.
`--device`	auto	`cpu`, `cuda`, `metal`, `vulkan`.
`--cache`	`~/.cache/tomoul`	Model weight cache dir.
`--cloud`	off	Route through `api.tomoul.ai` instead of local GPU.
`--telemetry`	off	Anonymous usage pings — off by default.

GPU & quantization

On NVIDIA / AMD GPUs we default to Q4_0 via Vulkan compute shaders. On Apple Silicon we use Metal + Q8_K. On CPU we use SIMD (AVX2 / NEON / WASM-SIMD) with the bundled pure-Zig zblas fallback. Run tomoul doctor to see what your machine supports.

← Previous

tomoul run

tomoul pull

Last updated 13 May 2026Edit this page on GitHub