GPU & quantization

One GPU HAL over Vulkan, Metal, WebGPU, and a CPU fallback. Quantization is per-tensor, per-block, and chosen at load time.

Backend selection

var model = try tomoul.models.bge_m3.load(a, .{ .device = .auto });
// .auto, .cpu, .vulkan, .metal, .webgpu

Backends live in src/gpu/: vulkan.zig, metal.zig, webgpu.zig, plus the shared hal.zig interface every backend implements.

Platform	Default backend
Linux NVIDIA / AMD	Vulkan
macOS (Apple Silicon)	Metal
macOS (Intel)	Metal (limited) or CPU
Windows	Vulkan
Browser / WASM	WebGPU (falls back to CPU)

No separate CUDA backend.

NVIDIA hardware is targeted through Vulkan compute shaders — one codepath, no NVIDIA-only tax. ROCm isn't required for AMD either.

CPU path

On the CPU backend, matrix math goes through BLAS. Two options at build time:

OpenBLAS or Apple Accelerate — link the system library. Pass -Dblas=true.
zblas — the bundled pure-Zig fallback. No system dependency. WASM builds use it automatically.

The fallback isn't a stub — zblas is fast enough that small models (Silero VAD, sentence-transformer, bge-m3 at smaller dims) run well on CPU without any system BLAS.

Quantization formats

Format	Bits	Use case
`f32`	32	Reference / validation only.
`f16`	16	Default for embeddings + small models.
`q8_0`	8	Solid quality / size tradeoff.
`q4_0`	4	Compact, mid-large LLMs.
`q8_k`	8 (per-block)	Higher fidelity than q8_0 at near-equal size.

Pick at load time:

var model = try tomoul.arch.llama.LlamaModel.load(a, .{
    .weights = "phi-4.tl",
    .quant   = .q4_0,
});

Browser / WASM

The engine cross-compiles to wasm32 with the WebGPU backend. Smaller embedding models (bge-m3, mxbai-embed) run in the browser. LLMs are size-prohibitive in WASM today — use the cloud API instead.

← Previous

Architectures

Stability

Last updated 13 May 2026Edit this page on GitHub