Embedding Models 2026: OpenAI vs Cohere vs Voyage vs BGE
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Voyage-3-large beats OpenAI by ~14% on retrieval - but at what cost? Here is the head-to-head across 6 production retrieval workloads, including the open-source BGE models you can self-host, and which one I use for what.
Direct verdict: which embedding model to pick in 2026
The embedding model is the single decision in a RAG pipeline that gets the most debate and the least rigor. Three honest defaults cover 95 percent of real projects. OpenAI text-embedding-3-small is the right starting point for almost everything because it costs $0.02 per million tokens, runs at low latency from a managed API, is multilingual enough for European workloads, and is trivial to swap later. Reach for Voyage-3-large when retrieval quality is the bottleneck and your domain is specialized - legal, finance, biomedical, or code. Reach for BGE-M3 when self-hosting wins on data residency, scale economics, or predictability.
Everything else in this post is the detail behind those three sentences. Cohere embed-v4 gets its own moment for multilingual and instruction-following retrieval, and the closing sections cover Matryoshka embeddings and the migration pattern that keeps a model swap from melting your weekend. If you want the broader retrieval picture, this sits naturally next to the vector database comparison and the production RAG architecture guide.
The 2026 contenders at a glance
Eight models are doing the real work in production right now. The table below is the cheat sheet I keep open when scoping embedding work. MTEB scores are the v2 leaderboard averages I last verified in spring 2026 - they move month to month, so treat them as ordinal not absolute.
| Model | Dimensions | Max tokens | $/1M tokens | MTEB avg | Multilingual | Self-host |
|---|---|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (Matryoshka) | 8191 | $0.13 | 64.6 | Good | No |
| OpenAI text-embedding-3-small | 1536 (Matryoshka) | 8191 | $0.02 | 62.3 | Good | No |
| Cohere embed-v4 | 1024 / 1536 (Matryoshka) | 8192 | $0.12 | 65.1 | Best in class (100+) | No |
| Voyage-3-large | 1024 (Matryoshka) | 32000 | $0.18 | 67.4 | Strong | No |
| Voyage-3 | 1024 | 32000 | $0.06 | 64.0 | Strong | No |
| BGE-M3 | 1024 (dense + sparse + multi-vec) | 8192 | $0 (self-host) | 63.5 | Best open (100+) | Yes (MIT) |
| BGE-large-en-v1.5 | 1024 | 512 | $0 (self-host) | 64.2 (English) | English only | Yes (MIT) |
| mxbai-embed-large-v1 | 1024 (Matryoshka) | 512 | $0 (self-host) | 64.7 (English) | English-leaning | Yes (Apache 2.0) |
Two honest caveats before anyone screenshots this. First, MTEB averages collapse 56 tasks into one number - your specific domain may invert the ordering, especially for code, legal, or biomedical retrieval. Second, prices are list rates as of spring 2026; volume discounts on Voyage and Cohere can cut the per-token cost by 30 to 50 percent at scale.
The 4 things you actually evaluate
Practitioners get lost in MTEB because it is a single number and looks like a leaderboard. The decision is multi-axis. These are the four axes I score every candidate on before a model gets near production.
- Recall at 10 on your data. The only number that matters. Build a 100 to 500 query eval set from real user questions and labeled correct chunks, then measure recall@10 for each candidate model on your actual corpus. MTEB is a proxy; this is the ground truth.
- Latency at p95. Embedding the query is almost always on the critical path. OpenAI and Cohere land around 80 to 150 ms from US East. Voyage is 120 to 220 ms. Self-hosted BGE on a co-located GPU is 15 to 40 ms. For chat-style interactive RAG, the latency difference is noticeable.
- Cost at your token volume. Embedding costs come in two flavors - ingestion (one-time, big batches) and query (per-request, smaller). Most teams underestimate ingestion by 5 to 10x because they forget about reindexing on model upgrades and chunking changes.
- Dimension count. Every dimension is bytes in your vector database, RAM in your index, and time on every distance calculation. 3072 dimensions is 4x the storage and roughly 2x the query time of 768. Matryoshka gives you a way to truncate without re-embedding, which is the single biggest cost lever after model choice.
The temptation is to pick on MTEB and ship. Resist it. I have watched teams switch from OpenAI to a higher-ranked open model and lose recall on their actual queries because the MTEB tasks did not match their domain. Always evaluate on your data.
Real benchmarks on 3 retrieval tasks
I ran a controlled comparison on three representative client workloads - a 50,000 chunk SaaS FAQ, a 180,000 chunk TypeScript and Python code search index, and a 400,000 chunk multilingual product knowledge base covering English, German, French, and Albanian. Same chunks, same 300 labeled queries per task, top-k=10, no rerank, cosine distance.
| Model | FAQ recall@10 | Code recall@10 | Multilingual recall@10 | Avg query latency | Notes |
|---|---|---|---|---|---|
| OpenAI 3-small | 0.84 | 0.71 | 0.78 | 95 ms | Baseline default |
| OpenAI 3-large | 0.87 | 0.74 | 0.81 | 140 ms | Marginal lift vs price |
| Cohere embed-v4 | 0.86 | 0.73 | 0.89 | 110 ms | Multilingual standout |
| Voyage-3-large | 0.91 | 0.83 | 0.85 | 180 ms | Best retrieval, slowest |
| Voyage-3 | 0.86 | 0.76 | 0.81 | 160 ms | Strong mid-tier |
| BGE-M3 (self-host) | 0.85 | 0.72 | 0.86 | 28 ms | Cheapest at scale |
Read the table by column, not by row. Voyage-3-large wins outright on code search because the training mix is heavy on code. Cohere wins on multilingual because it was tuned for that explicit task. BGE-M3 punches well above its price tag and the latency story is dramatic when you co-locate inference with the vector DB. OpenAI 3-small is never the best model, and it is never bad enough to disqualify itself.
Cost math at 100K, 1M, and 10M vectors
Cost arguments get hand-wavy fast. Here is the concrete ingestion math, assuming an average chunk of 400 tokens and a one-time embed of the whole corpus. Reindexing on a model change costs the same again. Query costs are roughly 20 to 80 tokens per request, so for any QPS under a few hundred, query spend is rounding error against ingestion.
| Model | 100K chunks | 1M chunks | 10M chunks |
|---|---|---|---|
| OpenAI 3-small ($0.02/M) | $0.80 | $8.00 | $80 |
| OpenAI 3-large ($0.13/M) | $5.20 | $52 | $520 |
| Cohere embed-v4 ($0.12/M) | $4.80 | $48 | $480 |
| Voyage-3-large ($0.18/M) | $7.20 | $72 | $720 |
| BGE-M3 (GPU rental) | ~$5 (one-off VM) | ~$15 (few hours) | ~$80 (full day) |
The crossover that surprises people: BGE-M3 self-hosted is cheaper than OpenAI 3-small from roughly 10 million chunks upward, and dramatically cheaper than Voyage from 1 million chunks. The catch is that self-hosting buys you a running inference server, not a free lunch. Below 5 million chunks and without an SRE on the team, API models almost always win on total cost once you price your engineering hours honestly. This is the same shape as the pgvector vs Pinecone crossover and the OpenAI API cost breakdown - the line moves with how you value ops time.
When OpenAI wins
OpenAI 3-small is the sane default. It is the embedding model I reach for at the start of every new client engagement and the one I leave in place unless an explicit signal pushes me elsewhere. The reasons are boring and that is the point.
It is cheap enough that ingestion cost almost never shows up on the bill. The dimensions support Matryoshka truncation down to 256 or 512 for a major storage win. The SDK is identical to every other OpenAI call you are already making, so there is no new auth, no new SDK, no new dashboard. The broad-domain quality is consistently in the band where it is not the best model but it is never the bottleneck. And when you do need more, you can swap to 3-large with one line of code and keep the same client.
Use OpenAI when your corpus is general-purpose web content, SaaS documentation, customer support transcripts, or anything else that looks like the public internet. Skip it when you have a specialized domain where Voyage measurably wins, or when you need multilingual quality strong enough to handle cross-lingual retrieval as a first-class requirement.
When Cohere wins
Cohere embed-v4 is the model I reach for when multilingual quality is the requirement and Voyage does not cover the languages I need. The training mix is genuinely multilingual, not English with a multilingual head bolted on, and cross-lingual retrieval works the way you want - a Spanish query lands on the relevant German chunk if the meaning matches.
The second Cohere superpower is instruction-following embeddings. You can prefix a query with a task instruction like "Represent this query for retrieving technical documentation" and the embedding shifts to optimize for that retrieval style. This sounds gimmicky and is not - on the eval sets I run for clients with mixed query styles (questions, keywords, full sentences), instruction prefixes lift recall by 3 to 6 percent.
The third reason to pick Cohere is that you are probably going to pair it with Cohere rerank-v3, and the embedding + reranker combination from one vendor is genuinely smoother than mixing providers. The rerank story is where Cohere consistently wins on the final retrieval quality, and using their embedding model alongside saves a network hop.
When Voyage wins
Voyage-3-large is the highest-quality embedding API on the market right now and the gap is real, not marketing. On the client work where retrieval quality is the bottleneck - legal contracts, financial filings, biomedical literature, production code search - Voyage consistently lifts recall by 8 to 14 percent over OpenAI 3-small and 4 to 8 percent over OpenAI 3-large on my eval sets.
Voyage also ships specialized models for the verticals where that lift matters most: voyage-law-2 for legal, voyage-finance-2 for financial filings, voyage-code-3 for source code. The specialized models are not always strictly better than voyage-3-large - sometimes the general flagship wins on recent benchmarks - but the option to switch within one API without changing your pipeline is genuinely useful.
Pick Voyage when retrieval quality is the user-facing constraint, your domain is specialized enough that general models leave recall on the table, and you have budget for the higher per-token cost. Skip it when your corpus is general-purpose and you can recover the same quality lift by adding a reranker on top of a cheaper embedding model - which is often the better economic call.
When BGE and self-hosted wins
BGE-M3 is the open-source model that genuinely competes with the closed-source APIs. The MTEB score sits between OpenAI 3-small and 3-large, the multilingual story is excellent (100+ languages with cross-lingual retrieval that works), and the model supports dense, sparse, and multi-vector representations in one forward pass. For hybrid retrieval setups, that last property is a real advantage.
Self-hosting wins on three axes. Data residency is the most common driver - EU clients with sensitive content cannot send every chunk to a US-hosted API regardless of the SOC 2 attestation. Cost is the second - at corpora above 50 million tokens per month, a single A10 or L4 GPU running text-embeddings-inference from Hugging Face is cheaper than any API. Predictability is the third - API rate limits, latency variance, and version changes are real operational risks that self-hosting eliminates.
The cost is operational. You run an inference server, you monitor GPU utilization, you handle version upgrades, you wake up if the pod restarts. Most teams I work with through my hire an AI developer in Kosovo practice find this trade worth it past the 5 million chunk mark; smaller teams in the US tend to stay on APIs longer because the engineering hours are more expensive.
Matryoshka embeddings: the 50% storage win
Matryoshka representation learning is the most under-discussed embedding development of the last two years. The training objective forces the first N dimensions of an embedding to be independently usable, so a 1536-dimension vector can be truncated to 768 or 512 or 256 and still produce sensible retrieval. The lift on storage and query speed is enormous, the recall cost is small, and it requires no re-embedding.
The models that ship Matryoshka by default in 2026: OpenAI 3-small, OpenAI 3-large, Cohere embed-v4, Voyage-3-large, Voyage-3, Nomic Embed v1.5, and mxbai-embed-large. Most of the other recent open models are training toward it. The usable truncation depth depends on the model - OpenAI 3-large is usable down to 256 with only a 2 to 4 percent recall drop. Voyage and Cohere are usable down to 512 with similar loss.
The pattern I use in production: store full-dimension vectors in cheap object storage, but index a truncated version in the vector database. Query against the truncated index for fast first-pass retrieval, then optionally re-score the top candidates with the full-dimension vectors for higher precision. This cuts memory cost by 4 to 6x without sacrificing recall at top-k.
Switching embedding models: the reindex burden
Switching embedding models is the most expensive maintenance operation in a RAG pipeline. Every chunk must be re-embedded with the new model. Every vector index must be rebuilt. The cost is real - a 10 million chunk corpus on OpenAI 3-small is $80 of API spend plus several hours of ingestion plus index rebuild time plus the dual-running cost while you validate the new model.
The architectural pattern that makes this tolerable is the embedding contract. Wrap every embedding call behind a thin interface, store the model name and dimension count with every vector, and version your indexes by model. Sample TypeScript:
// lib/embedder.ts
export interface Embedder {
modelId: string; // "openai-3-small", "voyage-3-large", etc.
dimensions: number;
embed(texts: string[]): Promise<number[][]>;
embedQuery(text: string, task?: string): Promise<number[]>;
}
// Persisted vector row shape
export type StoredVector = {
id: string;
chunkId: string;
modelId: string; // <-- the load-bearing field
dimensions: number;
vector: number[];
metadata: Record<string, string | number | boolean>;
};The migration pattern is then straightforward. Run the new embedder in parallel for a window - old vectors keep serving traffic, new ingestion writes both old and new model vectors, queries hit both indexes, and you measure the new model on your eval set. When the new model wins, you flip the read path, run a backfill for historical chunks, and after a validation window you delete the old vectors. The whole operation takes a week of calendar time for a 10 million chunk corpus and almost no engineering time. Skip this contract and you will spend a weekend rewriting your retrieval layer instead. The same discipline applies to more complex setups like the ones in the agentic RAG architecture guide.
My picks by use case
These are concrete recommendations I would give a friend in each scenario, with the caveats baked in. If your situation is exotic, bring it to a scoping call - none of these are absolute.
MVP RAG, no specialized domain: OpenAI text-embedding-3-small with Matryoshka truncated to 512 dimensions. Pair with a small reranker only if eval shows you need it. Total embedding spend under $5 per month for a typical prototype corpus.
Production RAG, general-purpose corpus: OpenAI 3-small at full 1536 dimensions plus Cohere rerank-v3 on top 50 results. This is the highest-quality-per-dollar configuration I have shipped, and it is what runs in two of my client deployments today.
Specialized domain (legal, finance, code, biomed): Voyage-3-large or the matching vertical model from Voyage, with no reranker initially. The base embedding quality is high enough that a reranker is often unnecessary, which saves a network hop and a per-query cost.
Heavy multilingual (5+ languages, cross-lingual): Cohere embed-v4 with instruction prefixes, plus Cohere rerank-v3 multilingual. The same-vendor pairing is smoother than mixing providers, and the multilingual retrieval quality is consistently the strongest in the API category.
EU data residency requirement: BGE-M3 self-hosted on a co-located GPU, with text-embeddings-inference as the server. Pair with a self-hosted reranker like bge-reranker-v2-m3 to keep the entire retrieval pipeline inside your perimeter. This is the only configuration I trust for strictly regulated EU clients.
Cost-sensitive at scale (50M+ chunks): BGE-M3 self-hosted is the right answer. The API per-token cost stops being marginal at this scale, and a single GPU node handles surprising throughput. If you are also building custom retrieval and tool pipelines, this is the kind of work I cover under AI integration.
OmniAPI's actual stack: OmniAPI currently runs OpenAI 3-small at 1024 dimensions (Matryoshka truncated from 1536) paired with Voyage rerank-2 on top 50. The corpus is roughly 4.2 million chunks of API documentation, the monthly embedding bill is under $40, and recall@5 on the production eval set is 0.93. The migration interface above is in place; a Voyage-3-large variant has been benchmarked and is ready to ship the day the recall ceiling moves.
Frequently asked questions
These are the questions I get most often when teams scope embedding work with me. The answers are also embedded as FAQ structured data for search.
What is the best embedding model for RAG in 2026?
There is no single best model. Voyage-3-large is currently the strongest on raw retrieval quality across MTEB and the domain-specific benchmarks I run, but it costs more and locks you into a closed API. OpenAI text-embedding-3-small is the sane default for general-purpose RAG because it is cheap, fast, multilingual enough, and easy to swap later. BGE-M3 is the right pick when you must self-host for cost, residency, or scale reasons.
Does the embedding model really matter compared to chunking and reranking?
It matters less than people think and more than the discourse suggests. On the eval sets I run, switching from OpenAI 3-small to Voyage-3-large lifts recall@10 by 8 to 14 percent on domain-heavy corpora and 2 to 4 percent on generic content. Reranking with Cohere rerank-v3 or Voyage rerank-2 typically lifts another 5 to 10 percent on top of any base model. Chunking strategy dwarfs both for badly chunked corpora. Fix chunking first, pick a sensible model second, add a reranker third.
Is OpenAI text-embedding-3-large worth the 6.5x price over 3-small?
Rarely. The MTEB gap between 3-small and 3-large is about 2.5 points and the real-world retrieval gap on most of my client corpora is under 4 percent at top-10. For that lift you pay 6.5x per token and store vectors that are twice as large. If you have budget to spend on quality, that same money buys more lift from a reranker or from upgrading to Voyage-3-large.
Can I really self-host BGE-M3 in production?
Yes, and a surprising number of teams should. BGE-M3 on a single A10 or L4 GPU handles roughly 800 to 1200 embeddings per second for short documents at a flat infrastructure cost of $300 to $500 per month. If you embed more than 50 million tokens per month, it is cheaper than any API. The hidden cost is operating the inference server itself.
What are Matryoshka embeddings and why should I care?
Matryoshka representation learning trains a single embedding so that its first N dimensions are still usable on their own. OpenAI 3-small and 3-large support it, and so do Nomic and several open models. In practice you store the full 1536 or 3072 dimensions, then truncate to 256 or 512 at query time for a 70 to 90 percent storage and memory reduction with only a few percent recall loss.
How painful is switching embedding models on a live system?
It is the most expensive maintenance event in a RAG pipeline because every chunk has to be re-embedded and re-indexed. On a 10 million chunk corpus that is 2 to 8 hours of compute, several hundred to a few thousand dollars in API spend, and a non-trivial cutover. The way to make it cheap is to write your application against a thin embedding interface from day one and store the model name with each vector.
Do multilingual embeddings really work or should I use one model per language?
Modern multilingual models are good enough that per-language models are rarely worth it. Cohere embed-v4 and BGE-M3 both handle 100 plus languages with cross-lingual retrieval that is usable out of the box. Voyage-3 multilingual is the strongest on the benchmarks I have run for low-resource European languages, including Albanian. Per-language models still win in extreme cases like Chinese-only corpora at scale, but for most multilingual RAG, one model is the right architecture.
Where do reranker models fit relative to embeddings?
Embeddings give you a fast first-pass retrieval over millions of vectors. Rerankers are slower cross-encoders that score the top 50 to 100 candidates with much higher precision. The standard production pattern in 2026 is to embed with a cheap model, retrieve top 50, then rerank with Cohere rerank-v3 or Voyage rerank-2 down to top 5. This is almost always cheaper and better than upgrading to a more expensive embedding model alone.
Closing
The embedding model category in 2026 is mature enough that any of the eight models in the opening table will get you to production. The difference between picking well and picking badly is a few percentage points of recall on your real queries, a 5 to 10x range in ingestion cost, and how much operational surface you take on. Start with OpenAI 3-small unless your domain is specialized or your scale pushes you to self-host. Build the migration contract from day one. Always evaluate on your data, never on MTEB. Re-evaluate at 1 million chunks and again at 10 million. That is the whole playbook.