RAG with bge-m3

End-to-end retrieval pipeline using Tomoul's embeddings, reranker, and a small chat model. ~50 lines, ~$0 in credits.

Setup

pip install openai numpy
export TOMOUL_KEY=tomoul_sk_...

Indexing

from openai import OpenAI
client = OpenAI(api_key="$TOMOUL_KEY", base_url="https://api.tomoul.ai/v1")
 
DOCS = [
    "bge-m3 is BAAI's multilingual embedding model.",
    "Tomoul hosts bge-m3 in EU regions today.",
    "Helsinki is the capital of Finland.",
]
vecs = client.embeddings.create(model="baai/bge-m3", input=DOCS).data
vectors = [v.embedding for v in vecs]

Store vectors alongside DOCS in your vector DB of choice — pgvector, Qdrant, LanceDB, all work.

Querying

import numpy as np
 
def search(query, k=5):
    qv = client.embeddings.create(
        model="baai/bge-m3",
        input=query,
    ).data[0].embedding
    scores = np.dot(vectors, qv)
    idx = np.argsort(-scores)[:k]
    return [DOCS[i] for i in idx]

Rerank step

import requests
 
def rerank(query, candidates, top_n=3):
    r = requests.post(
        "https://api.tomoul.ai/v1/rerank",
        headers={"Authorization": f"Bearer $TOMOUL_KEY"},
        json={
            "model": "baai/bge-reranker-v2-m3",
            "query": query,
            "documents": candidates,
            "top_n": top_n,
        },
    ).json()
    return [candidates[hit["index"]] for hit in r["results"]]

The rerank step routinely lifts answer quality by 15–30 points on retrieval benchmarks — disproportionately the cheapest quality win in the stack.

Generate the answer

def answer(query):
    hits = rerank(query, search(query, k=10), top_n=3)
    prompt = f"Context:\n{chr(10).join(hits)}\n\nQuestion: {query}"
    resp = client.chat.completions.create(
        model="microsoft/phi-4",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

Production tips

Batch embed calls. Up to 2048 inputs per request — see Embeddings batching.
Pin a region with X-Tomoul-Region to keep latency predictable.
Cache user-query embeddings in your app — same prompt twice = same vector, no need to re-embed.
For multilingual corpora (Swahili + English mixed), bge-m3 beats most English-only models at retrieval recall — don't fragment your index across models. See Multilingual retrieval.

← Previous

Function calling

Multilingual

Last updated 13 May 2026Edit this page on GitHub