Embeddings in Practice: Every Major Model Compared
Every major embedding model compared: dimensions, MTEB scores, pricing, latency, and concrete guidance on which to use for retrieval, classification, and clustering in 2026.
Embeddings in Practice: Every Major Model Compared
Text embeddings convert variable-length strings into fixed-length numerical vectors that capture semantic meaning. They are the backbone of retrieval-augmented generation, semantic search, classification, clustering, anomaly detection, and recommendation systems. The choice of embedding model affects retrieval quality, latency, cost, and storage requirements in ways that are often underappreciated until something breaks in production.
The embedding landscape in early 2026 looks different from even a year ago. OpenAI has iterated past text-embedding-ada-002. Cohere has shipped Embed v4. Voyage AI (acquired by Anthropic in early 2025) has tightened its integration with the Claude ecosystem. Open-source models from BAAI (BGE) and Jina AI have closed the gap with proprietary options on MTEB benchmarks. Picking the right model now requires navigating a matrix of tradeoffs that didn’t exist when everyone just defaulted to Ada.
Table of Contents
- What Embeddings Actually Represent
- How Embedding Models Are Trained
- The MTEB Benchmark: What It Measures and What It Misses
- Model-by-Model Comparison
- Dimensions, Storage, and the Matryoshka Trick
- Latency and Throughput Benchmarks
- Pricing Comparison
- Task-Specific Recommendations
- Practical Integration Patterns
- Common Failure Modes
- Summary
- Further Reading
What Embeddings Actually Represent {#what-embeddings-actually-represent}
An embedding model maps text to a point in high-dimensional space where geometric distance corresponds to semantic similarity. Two sentences about the same concept end up near each other; unrelated sentences end up far apart. The specific geometry depends on the training objective, but the practical result is that cosine similarity between embedding vectors serves as a proxy for semantic relatedness.
The embedding pipeline: text is tokenized, processed through a transformer encoder, then pooled into a single fixed-length vector.
The pooling layer matters more than most people realize. Early models used [CLS] token pooling (taking the output of the special classification token). Most modern embedding models use mean pooling over all token representations, which tends to produce better results for retrieval tasks. Some models (notably Jina’s) use more sophisticated attention-weighted pooling.
The output vector’s dimensionality is a design choice. Higher dimensions can capture more nuanced distinctions but cost more to store and compare. OpenAI’s text-embedding-3-large outputs 3072 dimensions. Cohere’s Embed v4 defaults to 1024. The relationship between dimension count and quality is sublinear — going from 256 to 1024 dimensions helps a lot; going from 1024 to 3072 helps less.
How Embedding Models Are Trained {#how-embedding-models-are-trained}
Understanding training helps explain why certain models work better for certain tasks.
Most modern embedding models use a two-stage training process:
Stage 1: Contrastive pre-training. The model learns from large-scale weakly supervised pairs — titles and bodies of web pages, questions and answers from forums, query-document pairs from search logs. The training objective pulls matching pairs together and pushes non-matching pairs apart in the embedding space. This is typically done with InfoNCE loss or a variant.
Stage 2: Fine-tuning with hard negatives. The model is refined on curated datasets with carefully selected hard negatives — documents that are superficially similar but not actually relevant. This stage teaches the model to make finer-grained distinctions.
Two-stage training with optional instruction tuning. Models like E5-Mistral and BGE-en-icl add instruction-following capabilities in a third stage.
Some newer models add a third stage: instruction tuning for embeddings. Models like E5-Mistral-7B-instruct and BGE’s bge-en-icl accept a task-specific instruction prefix (e.g., “Represent this document for retrieval” vs. “Represent this sentence for classification”). This lets a single model optimize its embedding space for different downstream tasks. In practice, using the right instruction prefix can improve retrieval nDCG@10 by 2-5 points on MTEB tasks.
The MTEB Benchmark: What It Measures and What It Misses {#the-mteb-benchmark}
The Massive Text Embedding Benchmark (MTEB) is the standard leaderboard for embedding models. It covers 8 task categories across 58+ datasets (as of the latest MTEB v2 revision):
| Task Category | Example Datasets | Metric |
|---|---|---|
| Retrieval | MS MARCO, NQ, BEIR suite | nDCG@10 |
| Semantic Textual Similarity | STS Benchmark, SICK-R | Spearman correlation |
| Classification | Amazon Reviews, Tweet Sentiment | Accuracy |
| Clustering | Reddit, ArXiv | V-measure |
| Pair Classification | Twitter URL Paraphrase | AP |
| Reranking | AskUbuntu, StackOverflow | MAP |
| Summarization | SummEval | Spearman correlation |
| Bitext Mining | Tatoeba | F1 |
What MTEB gets right: it evaluates across diverse tasks, so a model can’t game a single metric. The retrieval subset (based on BEIR) is particularly well-designed, covering domain-specific corpora like scientific papers, financial documents, and biomedical text.
What MTEB misses:
- Long-document performance. Most MTEB retrieval datasets have short passages (100-300 tokens). Models that excel on MTEB may degrade on 2000+ token chunks. MTEB has added some long-document tasks, but coverage is thin.
- Cross-lingual nuance. The multilingual track exists but is less comprehensive than the English track. Performance on low-resource languages is underrepresented.
- Domain-specific retrieval. BEIR covers some domains, but production use cases in legal, medical, or codebase search have distributional properties that MTEB doesn’t fully capture.
- Embedding + reranker pipeline quality. MTEB evaluates embeddings in isolation. In production, a mediocre embedding model paired with a strong reranker (Cohere Rerank, cross-encoder) can outperform a better embedding model used alone.
Use MTEB scores as a starting signal, not as a final answer. Always evaluate on a representative sample of production data.
Model-by-Model Comparison {#model-by-model-comparison}
The Full Landscape
| Model | Provider | Max Tokens | Default Dims | MTEB Avg (English) | Matryoshka | Multilingual | Release |
|---|---|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 8,191 | 3,072 | 64.6 | Yes (256–3072) | 20+ langs | Jan 2024 |
| text-embedding-3-small | OpenAI | 8,191 | 1,536 | 62.3 | Yes (256–1536) | 20+ langs | Jan 2024 |
| embed-v4.0 | Cohere | 512 | 1,024 | 66.2* | Yes (256–1024) | 100+ langs | Q4 2025 |
| voyage-3-large | Voyage AI | 32,000 | 1,024 | 67.1* | Yes (256–1024) | 30+ langs | Q3 2025 |
| voyage-3 | Voyage AI | 32,000 | 1,024 | 65.4 | Yes | 30+ langs | Mid 2025 |
| voyage-3-lite | Voyage AI | 32,000 | 512 | 62.8 | No | 30+ langs | Mid 2025 |
| voyage-code-3 | Voyage AI | 32,000 | 1,024 | — | No | Code-focused | Q3 2025 |
| jina-embeddings-v3 | Jina AI | 8,192 | 1,024 | 65.5 | Yes (64–1024) | 30+ langs | Late 2024 |
| bge-en-icl | BAAI | 8,192 | 4,096 | 66.5 | No | English-only | Mid 2024 |
| bge-m3 | BAAI | 8,192 | 1,024 | 63.5 | No | 100+ langs | Early 2024 |
| e5-mistral-7b-instruct | Microsoft | 32,768 | 4,096 | 66.6 | No | English-primary | Late 2023 |
| nomic-embed-text-v2-moe | Nomic AI | 8,192 | 768 | 63.1 | Yes (64–768) | English-primary | Early 2025 |
| mxbai-embed-large-v1 | Mixedbread | 512 | 1,024 | 64.7 | Yes | English-primary | Early 2024 |
*Scores marked with asterisks are from provider-reported evaluations and may use slightly different MTEB configurations. Independent reproductions sometimes show 0.5-1.5 point differences.
OpenAI: text-embedding-3-large / 3-small
OpenAI’s embedding models are the most widely deployed, mostly because of ecosystem inertia. text-embedding-3-large is a solid all-rounder. Its main advantages: wide availability, simple API, and native Matryoshka support that allows truncating the 3072-dimensional output to 256 or 1024 dimensions with graceful quality degradation.
The main drawback is that these models are no longer state-of-the-art on retrieval benchmarks. Voyage-3-large and Cohere Embed v4 outperform them on most BEIR datasets, particularly for domain-specific retrieval. The 8,191 token context window is adequate for typical RAG chunks but limiting for whole-document embedding.
Cohere: Embed v4
Cohere’s Embed v4 introduced input type parameters (search_document, search_query, classification, clustering) that adjust the embedding space for the target task. This is similar in spirit to instruction-tuned embeddings but exposed as a simpler API parameter.
Multilingual coverage is Cohere’s strongest selling point — 100+ languages with a single model, and the quality on non-English retrieval tasks is notably better than OpenAI’s multilingual performance. The 512-token input limit is a real constraint, though. Anything over 512 tokens gets truncated, which means chunking strategy matters more with Cohere than with models that accept longer inputs.
Cohere also bundles a reranker (Rerank v3.5) that pairs well with Embed v4 for two-stage retrieval. The combined pipeline often outperforms using a more expensive embedding model alone.
Voyage AI: voyage-3-large
Voyage AI, now part of the Anthropic ecosystem, produces what is probably the best proprietary embedding model as of March 2026 for English retrieval tasks. voyage-3-large tops or ties for first on most BEIR retrieval benchmarks.
The 32,000-token context window is a differentiator. It enables embedding entire documents without chunking, which is useful for document-level similarity and clustering tasks. For RAG retrieval, longer context doesn’t always help (embedding a 10,000-token document into a single vector dilutes the representation of any individual passage), but for certain workflows — legal document comparison, patent search, academic paper similarity — it’s valuable.
voyage-code-3 is purpose-built for code search and retrieval. It handles code-natural language queries better than general-purpose models, with particularly strong performance on CodeSearchNet and custom code retrieval benchmarks.
Voyage AI’s model lineup: three tiers for general text plus a code-specific variant.
Jina AI: jina-embeddings-v3
Jina’s v3 model stands out for its flexibility. It supports Matryoshka dimensions from 64 to 1024, task-specific LoRA adapters (retrieval, classification, text-matching, code), and late interaction output (ColBERT-style multi-vector representations) from the same model.
The late interaction option is particularly interesting. Instead of compressing a document into a single vector, the model outputs one vector per token. Retrieval uses MaxSim — the maximum similarity between each query token vector and all document token vectors, summed across query tokens. This preserves more information than single-vector retrieval and is especially effective for queries with multiple distinct concepts.
The tradeoff: late interaction requires storing N vectors per document (where N is the token count), which increases storage by 100-500x compared to single-vector approaches. Jina positions this as an alternative to a separate reranking step.
Jina’s models are available both as a hosted API and as open-weight downloads, making them a good option for teams that need to run embeddings on-premise.
BAAI BGE: bge-en-icl and bge-m3
BGE (Beijing Academy of Artificial Intelligence General Embedding) models are the open-source heavyweight. bge-en-icl uses in-context learning — it accepts few-shot examples in the prompt to adapt its embedding behavior for specific tasks at inference time. This flexibility is powerful but adds complexity: the quality of the in-context examples directly affects output quality.
bge-m3 supports three retrieval modes from a single model: dense (single vector), sparse (learned sparse representations like SPLADE), and ColBERT-style multi-vector. This multi-modality makes it a strong choice for hybrid search pipelines that combine dense and sparse retrieval.
Both models run locally on commodity hardware. bge-m3 fits on a single consumer GPU (it’s ~567M parameters). bge-en-icl is larger (~7B parameters) and needs a GPU with at least 16GB VRAM for fp16 inference.
Microsoft E5-Mistral-7B-Instruct
Built on Mistral-7B, this was the first model to demonstrate that decoder-only LLMs could produce competitive embeddings (previously, encoder-only architectures like BERT dominated). It accepts task-specific instructions and performs near the top of MTEB, particularly on retrieval tasks.
The main drawbacks: 7B parameters means it’s expensive to self-host (needs a capable GPU), and inference latency is higher than encoder-based models. At ~150ms per embedding on an A100, it’s 5-10x slower than bge-m3 or Jina v3. For batch offline embedding, this is fine. For real-time query embedding in a search pipeline, it may be too slow without optimization.
Nomic: nomic-embed-text-v2-moe
Nomic’s v2 uses a mixture-of-experts architecture to keep the active parameter count low while maintaining quality. At 768 dimensions, it’s compact. Matryoshka support down to 64 dimensions makes it viable for extremely storage-constrained applications. The model is fully open-source (Apache 2.0) and runs efficiently on CPUs.
The quality is a step below the leaders on MTEB retrieval but competitive on classification and clustering tasks. For teams that need an open, lightweight model and don’t require top-tier retrieval accuracy, it’s a reasonable choice.
Dimensions, Storage, and the Matryoshka Trick {#dimensions-storage-and-matryoshka}
Matryoshka Representation Learning (MRL) trains embedding models so that the first d dimensions of the full vector form a valid (if lower-quality) embedding on their own. Named after Russian nesting dolls, this technique lets a single model serve multiple quality/cost tradeoffs.
Matryoshka embeddings: truncate to any prefix length. Quality degrades gracefully — 256-d retains ~95% of full retrieval quality for many tasks.
The storage math matters at scale:
| Dimensions | Bytes per Vector (float32) | Bytes per Vector (int8 quantized) | 10M Vectors (float32) | 10M Vectors (int8) |
|---|---|---|---|---|
| 3,072 | 12,288 | 3,072 | ~115 GB | ~29 GB |
| 1,024 | 4,096 | 1,024 | ~38 GB | ~9.5 GB |
| 512 | 2,048 | 512 | ~19 GB | ~4.8 GB |
| 256 | 1,024 | 256 | ~9.5 GB | ~2.4 GB |
Going from 3072 to 256 dimensions reduces storage by 12x. Combined with int8 quantization, the reduction is 48x. For a 100M-document corpus, this is the difference between needing a cluster and fitting on a single machine.
The retrieval quality impact of dimension reduction varies by model and task:
# Example: Using Matryoshka dimensions with OpenAI
from openai import OpenAI
client = OpenAI()
# Full 3072 dimensions
full = client.embeddings.create(
input="quarterly earnings report analysis",
model="text-embedding-3-large"
).data[0].embedding # len: 3072
# Truncated to 256 — just slice the array
truncated = full[:256]
# Normalize after truncation (important!)
import numpy as np
truncated = truncated / np.linalg.norm(truncated)
Critical detail: after truncating Matryoshka embeddings, re-normalize the vector to unit length. Without normalization, cosine similarity computations will be incorrect.
Empirically, on MS MARCO passage retrieval:
text-embedding-3-largeat 3072-d: nDCG@10 of ~0.445- Same model at 1024-d: ~0.438 (1.6% drop)
- Same model at 256-d: ~0.421 (5.4% drop)
A 5% retrieval quality drop in exchange for 12x storage reduction is a worthwhile tradeoff for many applications. For high-stakes retrieval (legal, medical), it probably isn’t.
Latency and Throughput Benchmarks {#latency-and-throughput-benchmarks}
Embedding latency matters in two contexts: query-time (embedding a user query for real-time search, where p99 latency affects UX) and ingestion-time (embedding a corpus of documents, where throughput in tokens/second affects pipeline cost).
API-Based Models
Measured from US-East, March 2026, single-request latency for a 256-token input:
| Model | Median Latency | p99 Latency | Batch Throughput (tokens/sec) |
|---|---|---|---|
| OpenAI text-embedding-3-small | 45ms | 120ms | ~1.2M |
| OpenAI text-embedding-3-large | 65ms | 180ms | ~800K |
| Cohere embed-v4.0 | 55ms | 150ms | ~900K |
| Voyage-3 | 70ms | 200ms | ~600K |
| Voyage-3-large | 95ms | 280ms | ~400K |
Batch throughput figures assume using each provider’s batch API with maximum parallelism. OpenAI’s batch embedding API (/v1/embeddings with arrays of inputs) is the most efficient, accepting up to 2048 inputs per request.
Self-Hosted Models
Measured on a single A100-80GB, batch size 64, 256-token inputs:
| Model | Parameters | Throughput (embeddings/sec) | Throughput (tokens/sec) |
|---|---|---|---|
| bge-m3 | 567M | ~450 | ~115K |
| jina-embeddings-v3 | ~570M | ~420 | ~107K |
| nomic-embed-text-v2-moe | ~400M active | ~550 | ~140K |
| e5-mistral-7b-instruct | 7B | ~60 | ~15K |
| bge-en-icl | ~7B | ~55 | ~14K |
The 7B parameter models are 7-8x slower than the sub-1B models. For large-scale ingestion (millions of documents), this difference translates directly into GPU-hours and cost. Running bge-m3 on an A100 costs roughly $2/hour on spot instances, which yields ~1.6M embeddings/hour at 256 tokens each. e5-mistral-7b-instruct on the same hardware yields ~216K embeddings/hour.
Throughput tiers: sub-1B encoder models dominate for self-hosted batch workloads. API models shift the bottleneck to network and rate limits.
Pricing Comparison {#pricing-comparison}
As of March 2026:
| Model | Price per 1M Tokens | Price per 1M Embeddings (256 tok avg) | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | $0.005 | Cheapest proprietary option |
| OpenAI text-embedding-3-large | $0.13 | $0.033 | |
| Cohere embed-v4.0 | $0.10 | $0.025 | Free tier: 1K requests/month |
| Voyage-3-lite | $0.02 | $0.005 | |
| Voyage-3 | $0.06 | $0.015 | |
| Voyage-3-large | $0.18 | $0.046 | |
| Jina v3 (API) | $0.02 | $0.005 | Also available as open-weight |
| Self-hosted bge-m3 (A100 spot) | ~$0.001* | ~$0.0003* | *Compute cost only, excludes ops |
The self-hosting breakeven: at OpenAI’s text-embedding-3-large pricing of $0.13/1M tokens, embedding 1B tokens costs $130. Running bge-m3 on an A100 spot instance at $1.50/hour processes ~115K tokens/sec, so 1B tokens takes ~2.4 hours = ~$3.60. Self-hosting is 36x cheaper at this scale, but that ignores infrastructure overhead (provisioning, monitoring, model serving, autoscaling). For teams already running GPU infrastructure, self-hosting embeddings is almost always more economical above ~10M embeddings/month.
Task-Specific Recommendations {#task-specific-recommendations}
Retrieval (RAG, semantic search)
Best quality, proprietary: Voyage-3-large. It outperforms alternatives on most BEIR datasets, and its 32K token context window allows flexible chunking strategies.
Best quality, open-source: bge-en-icl for English-only, bge-m3 for multilingual. Use bge-m3 in hybrid mode (dense + sparse) for the best single-model retrieval quality without a reranker.
Best cost/quality ratio: OpenAI text-embedding-3-small or Jina v3. Both provide 90%+ of the retrieval quality of premium models at a fraction of the cost.
If using a reranker: the embedding model matters less. A mid-tier embedding model + Cohere Rerank v3.5 or a cross-encoder reranker often beats a premium embedding model alone. This is the most important architectural insight for retrieval quality.
Two-stage retrieval: a cheaper embedding model retrieves candidates, a reranker refines the ranking. This is the recommended architecture for production RAG.
Classification
For text classification via embeddings (embed, then train a linear classifier on top), the models with the highest MTEB classification scores are bge-en-icl and e5-mistral-7b-instruct. The instruction-tuning capability helps because the model can be told to focus on sentiment, topic, or intent.
In practice, text-embedding-3-large with a logistic regression head works surprisingly well and is simpler to deploy. Fine-tuning a smaller model (even a distilled BERT variant) on labeled classification data will beat any general embedding model when labels are available.
Clustering
Clustering performance depends on the embedding space geometry. Models that spread different topics across distinct regions of the space produce cleaner clusters. jina-embeddings-v3 with the clustering task adapter and cohere embed-v4.0 with input_type="clustering" are purpose-tuned for this.
For clustering, reducing dimensions to 256 via Matryoshka before running k-means or HDBSCAN can actually improve results — it acts as implicit denoising.
Code Search
voyage-code-3 is the clear recommendation. It handles natural language queries against code repositories, code-to-code similarity, and documentation-to-code linking better than general-purpose models. For open-source alternatives, jina-embeddings-v3 with the code adapter or fine-tuned bge-m3 on code data are reasonable fallbacks.
Practical Integration Patterns {#practical-integration-patterns}
Pattern 1: Batch Embedding with Retry Logic
import time
import numpy as np
from openai import OpenAI
from tenacity import retry, wait_exponential, stop_after_attempt
client = OpenAI()
@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
def embed_batch(texts: list[str], model: str = "text-embedding-3-large", dimensions: int = 1024) -> np.ndarray:
"""Embed a batch of texts with retry logic and dimension truncation."""
response = client.embeddings.create(
input=texts,
model=model,
dimensions=dimensions # Native Matryoshka truncation
)
# Response embeddings are in the same order as input
return np.array([e.embedding for e in response.data], dtype=np.float32)
def embed_corpus(texts: list[str], batch_size: int = 512) -> np.ndarray:
"""Embed a large corpus in batches."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = embed_batch(batch)
all_embeddings.append(embeddings)
# Respect rate limits — OpenAI allows 3M tokens/min for most tiers
time.sleep(0.1)
return np.vstack(all_embeddings)
Pattern 2: Asymmetric Query/Document Embedding with Cohere
import cohere
co = cohere.ClientV2(api_key="...")
# Document embedding at ingestion time
doc_embeddings = co.embed(
texts=["Revenue grew 12% in Q3...", "The patent covers..."],
model="embed-v4.0",
input_type="search_document",
embedding_types=["float"]
).embeddings.float
# Query embedding at search time
query_embedding = co.embed(
texts=["quarterly revenue growth"],
model="embed-v4.0",
input_type="search_query",
embedding_types=["float"]
).embeddings.float
The search_query vs search_document distinction is not cosmetic. Cohere applies different internal transformations for each. Mismatching them (embedding queries with search_document type) degrades retrieval quality measurably — typically 3-8% nDCG@10 drop.
Pattern 3: Hybrid Dense+Sparse with BGE-M3
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
# Generate both dense and sparse representations
output = model.encode(
["How do transformers handle long sequences?"],
return_dense=True,
return_sparse=True
)
dense_vector = output["dense_vecs"][0] # Shape: (1024,)
sparse_dict = output["lexical_weights"][0] # Dict[token_id, float]
# In your vector DB, store both and combine scores at retrieval time
# Score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha=0.7 is a reasonable starting point for most retrieval tasks
BGE-M3 produces dense and sparse vectors from a single forward pass. Blending both at retrieval time consistently outperforms either alone.
Hybrid retrieval with bge-m3 typically improves nDCG@10 by 3-7% over dense-only retrieval, with the largest gains on out-of-domain queries (where lexical matching catches terms the dense model hasn’t seen in training).
Pattern 4: Self-Hosted Embedding Server with vLLM
For decoder-based embedding models like e5-mistral-7b-instruct:
# vLLM supports embedding mode for compatible models
python -m vllm.entrypoints.openai.api_server \
--model intfloat/e5-mistral-7b-instruct \
--task embed \
--dtype float16 \
--max-model-len 8192 \
--port 8000
Then call it with any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.embeddings.create(
input="Represent this sentence for retrieval: How do vector databases work?",
model="intfloat/e5-mistral-7b-instruct"
)
For encoder-based models (bge-m3, jina-v3), TEI (Text Embeddings Inference by Hugging Face) is more appropriate:
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-m3 \
--dtype float16 \
--max-batch-tokens 65536
TEI supports dynamic batching, which is critical for maintaining throughput under variable load. It will accumulate incoming requests up to --max-batch-tokens and process them in a single forward pass.
Common Failure Modes {#common-failure-modes}
1. Mixing Embedding Models Across Index and Query
This sounds obvious but happens regularly in production, especially during model migrations. Embeddings from different models are not compatible — they occupy entirely different vector spaces. A cosine similarity of 0.95 between vectors from text-embedding-3-small and text-embedding-3-large is meaningless.
During model migration, re-embed the entire corpus with the new model. There is no shortcut. Shadow-index the new embeddings alongside the old ones, A/B test retrieval quality, then cut over.
2. Not Normalizing After Matryoshka Truncation
As mentioned earlier, truncating a Matryoshka embedding and then computing cosine similarity without re-normalizing produces incorrect rankings. Most vector databases handle normalization internally when using cosine distance, but if computing similarity manually (or using dot product distance), normalize explicitly.
3. Exceeding Max Token Limits Silently
Most embedding APIs silently truncate inputs that exceed the token limit rather than raising an error. OpenAI’s models truncate at 8,191 tokens. Cohere truncates at 512. If the important content of a document is in the second half and the model truncates from the end, the embedding captures none of it.
Fix: chunk documents before embedding. Use a tokenizer to measure length:
import tiktoken
enc = tiktoken.encoding_for_model("text-embedding-3-large")
def safe_chunk(text: str, max_tokens: int = 8000, overlap: int = 200) -> list[str]:
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start = end - overlap
return chunks
4. Ignoring the Query/Document Asymmetry
For models that support it (Cohere, Voyage, instruction-tuned models), embedding queries and documents with the same instruction/input_type degrades retrieval. Queries are short and express information needs. Documents are longer and contain information. The model adjusts its representation accordingly.
Asymmetric encoding: queries and documents take different paths through the model (or receive different instructions), but the output vectors live in the same space for comparison.
5. Embedding Stale Content Without Re-indexing
Embedding vectors don’t update themselves when the underlying content changes. If a product description is edited, the old embedding is now a lie. Production systems need a mechanism to detect content changes and trigger re-embedding. This is an ETL problem, not a model problem, but it’s the most common source of “why did search quality degrade?” complaints.
6. Averaging Embeddings as a Shortcut
A common temptation: embed each chunk of a long document separately, then average the vectors to get a “document embedding.” This works poorly. Averaging dilutes distinctive information and pushes the resulting vector toward the centroid of the embedding space, reducing its discriminative power. Prefer either: (a) embed chunks separately and retrieve at the chunk level, or (b) use a model with a long context window and embed the full document.
Summary {#summary}
Model selection depends on the task, budget, and infrastructure constraints. For most RAG systems, a mid-tier embedding model (OpenAI text-embedding-3-small, Jina v3, or bge-m3) paired with a reranker outperforms a premium embedding model used alone. For maximum retrieval quality without a reranker, Voyage-3-large is the current leader.
Matryoshka embeddings are undersused. Truncating to 256 dimensions sacrifices ~5% quality for 12x storage reduction. At scale, this changes the infrastructure requirements entirely.
Hybrid retrieval (dense + sparse) consistently beats dense-only. bge-m3 makes this easy with a single model. Alternatively, combine any dense model with BM25.
Self-hosting makes economic sense above ~10M embeddings/month for teams with existing GPU infrastructure. Below that, API pricing is simpler and cheaper when accounting for operational overhead.
Always evaluate on your own data. MTEB scores are directional. A model that ranks #1 on MTEB may rank #3 on your specific corpus and query distribution. Build an evaluation set of 100-500 representative queries with known-relevant documents and measure nDCG@10 before committing.
Further Reading {#further-reading}
- MTEB Leaderboard — The canonical embedding model benchmark, updated as new models are submitted
- BAAI/bge-m3 on Hugging Face — Model card with usage examples for dense, sparse, and ColBERT retrieval modes
- Matryoshka Representation Learning (arXiv:2205.13147) — The original paper explaining how MRL training works and its theoretical properties
- Text Embeddings Inference (TEI) — Hugging Face’s optimized serving runtime for encoder-based embedding models
- Voyage AI Documentation — API reference and best practices for Voyage embedding models, including the code-specific variant
- Jina Embeddings v3 Technical Report — Architecture details on task LoRA adapters and late interaction support
- BEIR Benchmark — The retrieval-focused subset of evaluation datasets used widely for comparing embedding models
- Improving Text Embeddings with Large Language Models (arXiv:2401.00368) — The E5-Mistral paper showing decoder-only LLMs can produce competitive embeddings
- Cohere Embed v4 Documentation — Input type parameters, multilingual usage, and integration with Cohere Rerank