Multilingual retrieval

A practical recipe for retrieval across languages — including Swahili, Yoruba, Amharic, and other African languages that monolingual embedders miss.

The problem

English-only embedders (text-embedding-3-large, mxbai-embed-large) cluster Swahili or Yoruba text into near-identical regions of the vector space. Recall on non-English queries collapses below 30% on real corpora.

The fix isn't to translate every document first — that's a tax on every query. The fix is to embed everything with a model that was trained multilingual.

Pick the right embedder

  • baai/bge-m3 — the default for >95% of multilingual retrieval. 100+ languages, 1024 dims, 8192-token context.
  • jinaai/jina-embeddings-v3 — when you also need image + text.
  • Tomoul partner embedders (Lelapa, Masakhane) for specialist African-language retrieval — when they list, they appear in GET /v1/models with the tomoul/ prefix.

Indexing a mixed corpus

Embed every doc with the same multilingual model. Don't embed English docs with one model and Swahili docs with another — your vectors won't share a space and similarity is meaningless across them.

texts = ["Good morning.", "Habari za asubuhi.", "Bonjour."]
resp = client.embeddings.create(model="baai/bge-m3", input=texts)

The three vectors above will land within ~0.15 cosine distance of each other.

Cross-lingual queries

A query in English will retrieve relevant Swahili docs and vice versa — no translation step needed.

search("morning greetings")          # finds "Habari za asubuhi."
search("salutations du matin")       # finds the same

African-language specifics

LanguageCodebge-m3 quality
SwahiliswStrong out of the box.
YorubayoAcceptable; specialist embedders recover 5–15 MRR points when listed.
HausahaAcceptable; same caveat.
IgboigAcceptable; same caveat.
AmharicamGe'ez script. Test recall before shipping.
TigrinyatiGe'ez script. Test recall before shipping.
Code-switchedSheng / Naija Pidgin work well in practice.
Don't skip the rerank.

On multilingual corpora, the relevance gap between embedding-top-K and cross-encoder rerank is wider than on English-only corpora. See RAG with bge-m3 for the full pipeline.

Last updated 13 May 2026Edit this page on GitHub