Tomoul Engine
A minimalist Zig inference engine. No Python runtime, no ONNX, no containers — just weights on disk and a binary that runs. Powers `api.tomoul.ai` and `tomoul serve` internally; importable into any Zig project. MIT-licensed.
What we actually focus on
The engine's focus is not LLMs in the first instance — it's the models that sit around LLMs and do the unglamorous work in production pipelines: detecting when someone is actually talking, fixing punctuation on raw transcripts, generating embeddings, transcribing speech. These are the models that always turn out to be the hardest part to deploy, and they don't get a Zig-native runtime anywhere else.
LLMs are in scope — InkubaLM and Qwen3.5 ship today, more on the roadmap — but the differentiator is the surrounding stack.
Three layers
Layer 3 — Stack src/models/ VAD, punctuation, Whisper, embeddings, LLMs
Layer 2 — Bridge tools/ Python export scripts: PyTorch → .tl
Layer 1 — Core src/core/, src/gpu/ Tensor, ops, GPU HAL, loader, quantization
The .tl format is a tiny binary container — magic bytes, dimensions, raw
floats. Convert once with the Python bridge in tools/; from that point
forward everything is pure Zig.
What the engine provides
Core (src/core/)
- Tensor + ops with SIMD (AVX2 / NEON / WASM-SIMD).
- BLAS via OpenBLAS / Apple Accelerate, or the bundled pure-Zig
zblasfallback. WASM builds usezblasautomatically. - Quantization: F32, F16, Q8_0, Q4_0, Q8_K with per-block scales.
GPU HAL (src/gpu/)
- Vulkan, Metal, WebGPU, and CPU fallback — one interface, four backends. Builds for desktop, mobile, and browser.
Architecture families (src/arch/)
Six families, each parameterised by config: llama, deltanet,
transformer_encoder, transformer_decoder, encoder_decoder, lstm.
Full details: Architectures.
Format readers (src/format/)
safetensors, HFconfig.json, and the in-house.tlformat. GGUF is on the roadmap.
Model registry (src/models/)
| Model | Family | Notes |
|---|---|---|
| Silero VAD | lstm | ~2.2 MB WASM. Voice activity detection. |
| XLM-RoBERTa Punctuation | transformer_encoder | Restores punctuation on raw transcripts. |
| sentence-transformer / bge-m3 | transformer_encoder | Multilingual embeddings. |
| Whisper (tiny / distil-small / large-v3-turbo) | encoder_decoder | Speech-to-text. |
| InkubaLM | llama | African-language LLM. Tomoul exclusive. |
| Qwen3.5 | llama | General-purpose LLM. |
When the engine adds a new architecture family or model, both
tomoul-cloud and tomoul-cli inherit it. That's the
compounding bet.
Who it's for
- Zig developers building inference into a non-AI app. A Zig CLI that wants embeddings, VAD, or transcription — take the engine + a catalog model.
- Researchers prototyping new architectures. Tensor / ops / GPU primitives are free; you write the forward pass and a validation harness.
- Other inference-engine builders. The modules are sharp enough that a fork is cheaper than greenfield.
- Educators. The source is small (~30k LOC of Zig) and modular enough to be a teaching artifact.
If you're calling the Tomoul cloud API from Python, Node, Go, or any
other language, you do not need the engine. Use any OpenAI SDK
pointed at https://api.tomoul.ai/v1. The engine is for Zig consumers
and contributors.
What it's not
- Not a per-language SDK matrix. No npm / PyPI / crates.io packages.
Non-Zig consumers integrate via HuggingFace release artifacts
(
.wasm,.a,.so,.dylib,.h,.tl) directly. - Not a training framework. Inference only.
- Not a generic ML toolkit. Curated architectures, not every PyTorch op.
- Not stable yet. Pre-1.0. Pin a commit. See Stability & versioning.
Source & license
- Repo: github.com/tomoul/tomoul
- License: MIT
- Release artifacts: huggingface.co/tomoul