Multilingual retrieval

A practical recipe for retrieval across languages — including Swahili, Yoruba, Amharic, and other African languages that monolingual embedders miss.

The problem

English-only embedders (text-embedding-3-large, mxbai-embed-large) cluster Swahili or Yoruba text into near-identical regions of the vector space. Recall on non-English queries collapses below 30% on real corpora.

The fix isn't to translate every document first — that's a tax on every query. The fix is to embed everything with a model that was trained multilingual.

Pick the right embedder

baai/bge-m3 — the default for >95% of multilingual retrieval. 100+ languages, 1024 dims, 8192-token context.
jinaai/jina-embeddings-v3 — when you also need image + text.
Tomoul partner embedders (Lelapa, Masakhane) for specialist African-language retrieval — when they list, they appear in GET /v1/models with the tomoul/ prefix.

Indexing a mixed corpus

Embed every doc with the same multilingual model. Don't embed English docs with one model and Swahili docs with another — your vectors won't share a space and similarity is meaningless across them.

texts = ["Good morning.", "Habari za asubuhi.", "Bonjour."]
resp = client.embeddings.create(model="baai/bge-m3", input=texts)

The three vectors above will land within ~0.15 cosine distance of each other.

Cross-lingual queries

A query in English will retrieve relevant Swahili docs and vice versa — no translation step needed.

search("morning greetings")          # finds "Habari za asubuhi."
search("salutations du matin")       # finds the same

African-language specifics

Language	Code	bge-m3 quality
Swahili	`sw`	Strong out of the box.
Yoruba	`yo`	Acceptable; specialist embedders recover 5–15 MRR points when listed.
Hausa	`ha`	Acceptable; same caveat.
Igbo	`ig`	Acceptable; same caveat.
Amharic	`am`	Ge'ez script. Test recall before shipping.
Tigrinya	`ti`	Ge'ez script. Test recall before shipping.
Code-switched	—	Sheng / Naija Pidgin work well in practice.

Don't skip the rerank.

On multilingual corpora, the relevance gap between embedding-top-K and cross-encoder rerank is wider than on English-only corpora. See RAG with bge-m3 for the full pipeline.

← Previous

RAG & embeddings

Models

Last updated 13 May 2026Edit this page on GitHub