GPU & quantization

One GPU HAL over Vulkan, Metal, WebGPU, and a CPU fallback. Quantization is per-tensor, per-block, and chosen at load time.

Backend selection

var model = try tomoul.models.bge_m3.load(a, .{ .device = .auto });
// .auto, .cpu, .vulkan, .metal, .webgpu

Backends live in src/gpu/: vulkan.zig, metal.zig, webgpu.zig, plus the shared hal.zig interface every backend implements.

PlatformDefault backend
Linux NVIDIA / AMDVulkan
macOS (Apple Silicon)Metal
macOS (Intel)Metal (limited) or CPU
WindowsVulkan
Browser / WASMWebGPU (falls back to CPU)
No separate CUDA backend.

NVIDIA hardware is targeted through Vulkan compute shaders — one codepath, no NVIDIA-only tax. ROCm isn't required for AMD either.

CPU path

On the CPU backend, matrix math goes through BLAS. Two options at build time:

  • OpenBLAS or Apple Accelerate — link the system library. Pass -Dblas=true.
  • zblas — the bundled pure-Zig fallback. No system dependency. WASM builds use it automatically.

The fallback isn't a stub — zblas is fast enough that small models (Silero VAD, sentence-transformer, bge-m3 at smaller dims) run well on CPU without any system BLAS.

Quantization formats

FormatBitsUse case
f3232Reference / validation only.
f1616Default for embeddings + small models.
q8_08Solid quality / size tradeoff.
q4_04Compact, mid-large LLMs.
q8_k8 (per-block)Higher fidelity than q8_0 at near-equal size.

Pick at load time:

var model = try tomoul.arch.llama.LlamaModel.load(a, .{
    .weights = "phi-4.tl",
    .quant   = .q4_0,
});

Browser / WASM

The engine cross-compiles to wasm32 with the WebGPU backend. Smaller embedding models (bge-m3, mxbai-embed) run in the browser. LLMs are size-prohibitive in WASM today — use the cloud API instead.

Last updated 13 May 2026Edit this page on GitHub