Embeddings in Practice: Every Major Model Compared

The post you bookmark. One topic, covered end to end.

Every major embedding model compared: dimensions, MTEB scores, pricing, latency, and concrete guidance on which to use for retrieval, classification, and clustering in 2026.

Embeddings in Practice: Every Major Model Compared

Text embeddings convert variable-length strings into fixed-length numerical vectors that capture semantic meaning. They are the backbone of retrieval-augmented generation, semantic search, classification, clustering, anomaly detection, and recommendation systems. The choice of embedding model affects retrieval quality, latency, cost, and storage requirements in ways that are often underappreciated until something breaks in production.

The embedding landscape in early 2026 looks different from even a year ago. OpenAI has iterated past text-embedding-ada-002. Cohere has shipped Embed v4. Voyage AI (acquired by Anthropic in early 2025) has tightened its integration with the Claude ecosystem. Open-source models from BAAI (BGE) and Jina AI have closed the gap with proprietary options on MTEB benchmarks. Picking the right model now requires navigating a matrix of tradeoffs that didn’t exist when everyone just defaulted to Ada.

Table of Contents

What Embeddings Actually Represent {#what-embeddings-actually-represent}

An embedding model maps text to a point in high-dimensional space where geometric distance corresponds to semantic similarity. Two sentences about the same concept end up near each other; unrelated sentences end up far apart. The specific geometry depends on the training objective, but the practical result is that cosine similarity between embedding vectors serves as a proxy for semantic relatedness.

Input TextTokenizerTransformer EncoderPooling LayerVector(eg, 1536-d)

The embedding pipeline: text is tokenized, processed through a transformer encoder, then pooled into a single fixed-length vector.

The pooling layer matters more than most people realize. Early models used [CLS] token pooling (taking the output of the special classification token). Most modern embedding models use mean pooling over all token representations, which tends to produce better results for retrieval tasks. Some models (notably Jina’s) use more sophisticated attention-weighted pooling.

The output vector’s dimensionality is a design choice. Higher dimensions can capture more nuanced distinctions but cost more to store and compare. OpenAI’s text-embedding-3-large outputs 3072 dimensions. Cohere’s Embed v4 defaults to 1024. The relationship between dimension count and quality is sublinear — going from 256 to 1024 dimensions helps a lot; going from 1024 to 3072 helps less.

How Embedding Models Are Trained {#how-embedding-models-are-trained}

Understanding training helps explain why certain models work better for certain tasks.

Most modern embedding models use a two-stage training process:

Stage 1: Contrastive pre-training. The model learns from large-scale weakly supervised pairs — titles and bodies of web pages, questions and answers from forums, query-document pairs from search logs. The training objective pulls matching pairs together and pushes non-matching pairs apart in the embedding space. This is typically done with InfoNCE loss or a variant.

Stage 2: Fine-tuning with hard negatives. The model is refined on curated datasets with carefully selected hard negatives — documents that are superficially similar but not actually relevant. This stage teaches the model to make finer-grained distinctions.

Pretrained LMContrastive Training(weak pairs, InfoNCE)Hard NegativeFine-tuningTask-SpecificInstruction TuningFinal Model

Two-stage training with optional instruction tuning. Models like E5-Mistral and BGE-en-icl add instruction-following capabilities in a third stage.

Some newer models add a third stage: instruction tuning for embeddings. Models like E5-Mistral-7B-instruct and BGE’s bge-en-icl accept a task-specific instruction prefix (e.g., “Represent this document for retrieval” vs. “Represent this sentence for classification”). This lets a single model optimize its embedding space for different downstream tasks. In practice, using the right instruction prefix can improve retrieval nDCG@10 by 2-5 points on MTEB tasks.

The MTEB Benchmark: What It Measures and What It Misses {#the-mteb-benchmark}

The Massive Text Embedding Benchmark (MTEB) is the standard leaderboard for embedding models. It covers 8 task categories across 58+ datasets (as of the latest MTEB v2 revision):

Task CategoryExample DatasetsMetric
RetrievalMS MARCO, NQ, BEIR suitenDCG@10
Semantic Textual SimilaritySTS Benchmark, SICK-RSpearman correlation
ClassificationAmazon Reviews, Tweet SentimentAccuracy
ClusteringReddit, ArXivV-measure
Pair ClassificationTwitter URL ParaphraseAP
RerankingAskUbuntu, StackOverflowMAP
SummarizationSummEvalSpearman correlation
Bitext MiningTatoebaF1

What MTEB gets right: it evaluates across diverse tasks, so a model can’t game a single metric. The retrieval subset (based on BEIR) is particularly well-designed, covering domain-specific corpora like scientific papers, financial documents, and biomedical text.

What MTEB misses:

  • Long-document performance. Most MTEB retrieval datasets have short passages (100-300 tokens). Models that excel on MTEB may degrade on 2000+ token chunks. MTEB has added some long-document tasks, but coverage is thin.
  • Cross-lingual nuance. The multilingual track exists but is less comprehensive than the English track. Performance on low-resource languages is underrepresented.
  • Domain-specific retrieval. BEIR covers some domains, but production use cases in legal, medical, or codebase search have distributional properties that MTEB doesn’t fully capture.
  • Embedding + reranker pipeline quality. MTEB evaluates embeddings in isolation. In production, a mediocre embedding model paired with a strong reranker (Cohere Rerank, cross-encoder) can outperform a better embedding model used alone.

Use MTEB scores as a starting signal, not as a final answer. Always evaluate on a representative sample of production data.

Model-by-Model Comparison {#model-by-model-comparison}

The Full Landscape

ModelProviderMax TokensDefault DimsMTEB Avg (English)MatryoshkaMultilingualRelease
text-embedding-3-largeOpenAI8,1913,07264.6Yes (256–3072)20+ langsJan 2024
text-embedding-3-smallOpenAI8,1911,53662.3Yes (256–1536)20+ langsJan 2024
embed-v4.0Cohere5121,02466.2*Yes (256–1024)100+ langsQ4 2025
voyage-3-largeVoyage AI32,0001,02467.1*Yes (256–1024)30+ langsQ3 2025
voyage-3Voyage AI32,0001,02465.4Yes30+ langsMid 2025
voyage-3-liteVoyage AI32,00051262.8No30+ langsMid 2025
voyage-code-3Voyage AI32,0001,024NoCode-focusedQ3 2025
jina-embeddings-v3Jina AI8,1921,02465.5Yes (64–1024)30+ langsLate 2024
bge-en-iclBAAI8,1924,09666.5NoEnglish-onlyMid 2024
bge-m3BAAI8,1921,02463.5No100+ langsEarly 2024
e5-mistral-7b-instructMicrosoft32,7684,09666.6NoEnglish-primaryLate 2023
nomic-embed-text-v2-moeNomic AI8,19276863.1Yes (64–768)English-primaryEarly 2025
mxbai-embed-large-v1Mixedbread5121,02464.7YesEnglish-primaryEarly 2024

*Scores marked with asterisks are from provider-reported evaluations and may use slightly different MTEB configurations. Independent reproductions sometimes show 0.5-1.5 point differences.

OpenAI: text-embedding-3-large / 3-small

OpenAI’s embedding models are the most widely deployed, mostly because of ecosystem inertia. text-embedding-3-large is a solid all-rounder. Its main advantages: wide availability, simple API, and native Matryoshka support that allows truncating the 3072-dimensional output to 256 or 1024 dimensions with graceful quality degradation.

The main drawback is that these models are no longer state-of-the-art on retrieval benchmarks. Voyage-3-large and Cohere Embed v4 outperform them on most BEIR datasets, particularly for domain-specific retrieval. The 8,191 token context window is adequate for typical RAG chunks but limiting for whole-document embedding.

Cohere: Embed v4

Cohere’s Embed v4 introduced input type parameters (search_document, search_query, classification, clustering) that adjust the embedding space for the target task. This is similar in spirit to instruction-tuned embeddings but exposed as a simpler API parameter.

Multilingual coverage is Cohere’s strongest selling point — 100+ languages with a single model, and the quality on non-English retrieval tasks is notably better than OpenAI’s multilingual performance. The 512-token input limit is a real constraint, though. Anything over 512 tokens gets truncated, which means chunking strategy matters more with Cohere than with models that accept longer inputs.

Cohere also bundles a reranker (Rerank v3.5) that pairs well with Embed v4 for two-stage retrieval. The combined pipeline often outperforms using a more expensive embedding model alone.

Voyage AI: voyage-3-large

Voyage AI, now part of the Anthropic ecosystem, produces what is probably the best proprietary embedding model as of March 2026 for English retrieval tasks. voyage-3-large tops or ties for first on most BEIR retrieval benchmarks.

The 32,000-token context window is a differentiator. It enables embedding entire documents without chunking, which is useful for document-level similarity and clustering tasks. For RAG retrieval, longer context doesn’t always help (embedding a 10,000-token document into a single vector dilutes the representation of any individual passage), but for certain workflows — legal document comparison, patent search, academic paper similarity — it’s valuable.

voyage-code-3 is purpose-built for code search and retrieval. It handles code-natural language queries better than general-purpose models, with particularly strong performance on CodeSearchNet and custom code retrieval benchmarks.

voyage-3-lite512d, budgetvoyage-31024d, generalvoyage-3-large1024d, best qualityvoyage-code-31024d, code-specific

Voyage AI’s model lineup: three tiers for general text plus a code-specific variant.

Jina AI: jina-embeddings-v3

Jina’s v3 model stands out for its flexibility. It supports Matryoshka dimensions from 64 to 1024, task-specific LoRA adapters (retrieval, classification, text-matching, code), and late interaction output (ColBERT-style multi-vector representations) from the same model.

The late interaction option is particularly interesting. Instead of compressing a document into a single vector, the model outputs one vector per token. Retrieval uses MaxSim — the maximum similarity between each query token vector and all document token vectors, summed across query tokens. This preserves more information than single-vector retrieval and is especially effective for queries with multiple distinct concepts.

The tradeoff: late interaction requires storing N vectors per document (where N is the token count), which increases storage by 100-500x compared to single-vector approaches. Jina positions this as an alternative to a separate reranking step.

Jina’s models are available both as a hosted API and as open-weight downloads, making them a good option for teams that need to run embeddings on-premise.

BAAI BGE: bge-en-icl and bge-m3

BGE (Beijing Academy of Artificial Intelligence General Embedding) models are the open-source heavyweight. bge-en-icl uses in-context learning — it accepts few-shot examples in the prompt to adapt its embedding behavior for specific tasks at inference time. This flexibility is powerful but adds complexity: the quality of the in-context examples directly affects output quality.

bge-m3 supports three retrieval modes from a single model: dense (single vector), sparse (learned sparse representations like SPLADE), and ColBERT-style multi-vector. This multi-modality makes it a strong choice for hybrid search pipelines that combine dense and sparse retrieval.

Both models run locally on commodity hardware. bge-m3 fits on a single consumer GPU (it’s ~567M parameters). bge-en-icl is larger (~7B parameters) and needs a GPU with at least 16GB VRAM for fp16 inference.

Microsoft E5-Mistral-7B-Instruct

Built on Mistral-7B, this was the first model to demonstrate that decoder-only LLMs could produce competitive embeddings (previously, encoder-only architectures like BERT dominated). It accepts task-specific instructions and performs near the top of MTEB, particularly on retrieval tasks.

The main drawbacks: 7B parameters means it’s expensive to self-host (needs a capable GPU), and inference latency is higher than encoder-based models. At ~150ms per embedding on an A100, it’s 5-10x slower than bge-m3 or Jina v3. For batch offline embedding, this is fine. For real-time query embedding in a search pipeline, it may be too slow without optimization.

Nomic: nomic-embed-text-v2-moe

Nomic’s v2 uses a mixture-of-experts architecture to keep the active parameter count low while maintaining quality. At 768 dimensions, it’s compact. Matryoshka support down to 64 dimensions makes it viable for extremely storage-constrained applications. The model is fully open-source (Apache 2.0) and runs efficiently on CPUs.

The quality is a step below the leaders on MTEB retrieval but competitive on classification and clustering tasks. For teams that need an open, lightweight model and don’t require top-tier retrieval accuracy, it’s a reasonable choice.

Dimensions, Storage, and the Matryoshka Trick {#dimensions-storage-and-matryoshka}

Matryoshka Representation Learning (MRL) trains embedding models so that the first d dimensions of the full vector form a valid (if lower-quality) embedding on their own. Named after Russian nesting dolls, this technique lets a single model serve multiple quality/cost tradeoffs.

Full Vector3072-dTruncate to1024-dTruncate to256-dTruncate to64-d

Matryoshka embeddings: truncate to any prefix length. Quality degrades gracefully — 256-d retains ~95% of full retrieval quality for many tasks.

The storage math matters at scale:

DimensionsBytes per Vector (float32)Bytes per Vector (int8 quantized)10M Vectors (float32)10M Vectors (int8)
3,07212,2883,072~115 GB~29 GB
1,0244,0961,024~38 GB~9.5 GB
5122,048512~19 GB~4.8 GB
2561,024256~9.5 GB~2.4 GB

Going from 3072 to 256 dimensions reduces storage by 12x. Combined with int8 quantization, the reduction is 48x. For a 100M-document corpus, this is the difference between needing a cluster and fitting on a single machine.

The retrieval quality impact of dimension reduction varies by model and task:

# Example: Using Matryoshka dimensions with OpenAI
from openai import OpenAI
client = OpenAI()

# Full 3072 dimensions
full = client.embeddings.create(
    input="quarterly earnings report analysis",
    model="text-embedding-3-large"
).data[0].embedding  # len: 3072

# Truncated to 256 — just slice the array
truncated = full[:256]

# Normalize after truncation (important!)
import numpy as np
truncated = truncated / np.linalg.norm(truncated)

Critical detail: after truncating Matryoshka embeddings, re-normalize the vector to unit length. Without normalization, cosine similarity computations will be incorrect.

Empirically, on MS MARCO passage retrieval:

  • text-embedding-3-large at 3072-d: nDCG@10 of ~0.445
  • Same model at 1024-d: ~0.438 (1.6% drop)
  • Same model at 256-d: ~0.421 (5.4% drop)

A 5% retrieval quality drop in exchange for 12x storage reduction is a worthwhile tradeoff for many applications. For high-stakes retrieval (legal, medical), it probably isn’t.

Latency and Throughput Benchmarks {#latency-and-throughput-benchmarks}

Embedding latency matters in two contexts: query-time (embedding a user query for real-time search, where p99 latency affects UX) and ingestion-time (embedding a corpus of documents, where throughput in tokens/second affects pipeline cost).

API-Based Models

Measured from US-East, March 2026, single-request latency for a 256-token input:

ModelMedian Latencyp99 LatencyBatch Throughput (tokens/sec)
OpenAI text-embedding-3-small45ms120ms~1.2M
OpenAI text-embedding-3-large65ms180ms~800K
Cohere embed-v4.055ms150ms~900K
Voyage-370ms200ms~600K
Voyage-3-large95ms280ms~400K

Batch throughput figures assume using each provider’s batch API with maximum parallelism. OpenAI’s batch embedding API (/v1/embeddings with arrays of inputs) is the most efficient, accepting up to 2048 inputs per request.

Self-Hosted Models

Measured on a single A100-80GB, batch size 64, 256-token inputs:

ModelParametersThroughput (embeddings/sec)Throughput (tokens/sec)
bge-m3567M~450~115K
jina-embeddings-v3~570M~420~107K
nomic-embed-text-v2-moe~400M active~550~140K
e5-mistral-7b-instruct7B~60~15K
bge-en-icl~7B~55~14K

The 7B parameter models are 7-8x slower than the sub-1B models. For large-scale ingestion (millions of documents), this difference translates directly into GPU-hours and cost. Running bge-m3 on an A100 costs roughly $2/hour on spot instances, which yields ~1.6M embeddings/hour at 256 tokens each. e5-mistral-7b-instruct on the same hardware yields ~216K embeddings/hour.

Sub-1B Models(bge-m3, jina-v3)~450 emb/sec7B Models(E5-Mistral, bge-en-icl)~60 emb/secAPI Models(OpenAI, Cohere)Network-bound

Throughput tiers: sub-1B encoder models dominate for self-hosted batch workloads. API models shift the bottleneck to network and rate limits.

Pricing Comparison {#pricing-comparison}

As of March 2026:

ModelPrice per 1M TokensPrice per 1M Embeddings (256 tok avg)Notes
OpenAI text-embedding-3-small$0.02$0.005Cheapest proprietary option
OpenAI text-embedding-3-large$0.13$0.033
Cohere embed-v4.0$0.10$0.025Free tier: 1K requests/month
Voyage-3-lite$0.02$0.005
Voyage-3$0.06$0.015
Voyage-3-large$0.18$0.046
Jina v3 (API)$0.02$0.005Also available as open-weight
Self-hosted bge-m3 (A100 spot)~$0.001*~$0.0003**Compute cost only, excludes ops

The self-hosting breakeven: at OpenAI’s text-embedding-3-large pricing of $0.13/1M tokens, embedding 1B tokens costs $130. Running bge-m3 on an A100 spot instance at $1.50/hour processes ~115K tokens/sec, so 1B tokens takes ~2.4 hours = ~$3.60. Self-hosting is 36x cheaper at this scale, but that ignores infrastructure overhead (provisioning, monitoring, model serving, autoscaling). For teams already running GPU infrastructure, self-hosting embeddings is almost always more economical above ~10M embeddings/month.

Task-Specific Recommendations {#task-specific-recommendations}

Best quality, proprietary: Voyage-3-large. It outperforms alternatives on most BEIR datasets, and its 32K token context window allows flexible chunking strategies.

Best quality, open-source: bge-en-icl for English-only, bge-m3 for multilingual. Use bge-m3 in hybrid mode (dense + sparse) for the best single-model retrieval quality without a reranker.

Best cost/quality ratio: OpenAI text-embedding-3-small or Jina v3. Both provide 90%+ of the retrieval quality of premium models at a fraction of the cost.

If using a reranker: the embedding model matters less. A mid-tier embedding model + Cohere Rerank v3.5 or a cross-encoder reranker often beats a premium embedding model alone. This is the most important architectural insight for retrieval quality.

QueryEmbedding Model(fast, cheap)Vector DBTop-100 candidatesReranker(slow, accurate)Top-10 Results

Two-stage retrieval: a cheaper embedding model retrieves candidates, a reranker refines the ranking. This is the recommended architecture for production RAG.

Classification

For text classification via embeddings (embed, then train a linear classifier on top), the models with the highest MTEB classification scores are bge-en-icl and e5-mistral-7b-instruct. The instruction-tuning capability helps because the model can be told to focus on sentiment, topic, or intent.

In practice, text-embedding-3-large with a logistic regression head works surprisingly well and is simpler to deploy. Fine-tuning a smaller model (even a distilled BERT variant) on labeled classification data will beat any general embedding model when labels are available.

Clustering

Clustering performance depends on the embedding space geometry. Models that spread different topics across distinct regions of the space produce cleaner clusters. jina-embeddings-v3 with the clustering task adapter and cohere embed-v4.0 with input_type="clustering" are purpose-tuned for this.

For clustering, reducing dimensions to 256 via Matryoshka before running k-means or HDBSCAN can actually improve results — it acts as implicit denoising.

voyage-code-3 is the clear recommendation. It handles natural language queries against code repositories, code-to-code similarity, and documentation-to-code linking better than general-purpose models. For open-source alternatives, jina-embeddings-v3 with the code adapter or fine-tuned bge-m3 on code data are reasonable fallbacks.

Practical Integration Patterns {#practical-integration-patterns}

Pattern 1: Batch Embedding with Retry Logic

import time
import numpy as np
from openai import OpenAI
from tenacity import retry, wait_exponential, stop_after_attempt

client = OpenAI()

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
def embed_batch(texts: list[str], model: str = "text-embedding-3-large", dimensions: int = 1024) -> np.ndarray:
    """Embed a batch of texts with retry logic and dimension truncation."""
    response = client.embeddings.create(
        input=texts,
        model=model,
        dimensions=dimensions  # Native Matryoshka truncation
    )
    # Response embeddings are in the same order as input
    return np.array([e.embedding for e in response.data], dtype=np.float32)


def embed_corpus(texts: list[str], batch_size: int = 512) -> np.ndarray:
    """Embed a large corpus in batches."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = embed_batch(batch)
        all_embeddings.append(embeddings)
        # Respect rate limits — OpenAI allows 3M tokens/min for most tiers
        time.sleep(0.1)
    return np.vstack(all_embeddings)

Pattern 2: Asymmetric Query/Document Embedding with Cohere

import cohere

co = cohere.ClientV2(api_key="...")

# Document embedding at ingestion time
doc_embeddings = co.embed(
    texts=["Revenue grew 12% in Q3...", "The patent covers..."],
    model="embed-v4.0",
    input_type="search_document",
    embedding_types=["float"]
).embeddings.float

# Query embedding at search time
query_embedding = co.embed(
    texts=["quarterly revenue growth"],
    model="embed-v4.0",
    input_type="search_query",
    embedding_types=["float"]
).embeddings.float

The search_query vs search_document distinction is not cosmetic. Cohere applies different internal transformations for each. Mismatching them (embedding queries with search_document type) degrades retrieval quality measurably — typically 3-8% nDCG@10 drop.

Pattern 3: Hybrid Dense+Sparse with BGE-M3

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

# Generate both dense and sparse representations
output = model.encode(
    ["How do transformers handle long sequences?"],
    return_dense=True,
    return_sparse=True
)

dense_vector = output["dense_vecs"][0]      # Shape: (1024,)
sparse_dict = output["lexical_weights"][0]  # Dict[token_id, float]

# In your vector DB, store both and combine scores at retrieval time
# Score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha=0.7 is a reasonable starting point for most retrieval tasks
Input TextBGE-M3(single model)Dense Vector1024-dSparse Vector(lexical weights)Hybrid Score(α=0.7 blend)

BGE-M3 produces dense and sparse vectors from a single forward pass. Blending both at retrieval time consistently outperforms either alone.

Hybrid retrieval with bge-m3 typically improves nDCG@10 by 3-7% over dense-only retrieval, with the largest gains on out-of-domain queries (where lexical matching catches terms the dense model hasn’t seen in training).

Pattern 4: Self-Hosted Embedding Server with vLLM

For decoder-based embedding models like e5-mistral-7b-instruct:

# vLLM supports embedding mode for compatible models
python -m vllm.entrypoints.openai.api_server \
    --model intfloat/e5-mistral-7b-instruct \
    --task embed \
    --dtype float16 \
    --max-model-len 8192 \
    --port 8000

Then call it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.embeddings.create(
    input="Represent this sentence for retrieval: How do vector databases work?",
    model="intfloat/e5-mistral-7b-instruct"
)

For encoder-based models (bge-m3, jina-v3), TEI (Text Embeddings Inference by Hugging Face) is more appropriate:

docker run --gpus all -p 8080:80 \
    ghcr.io/huggingface/text-embeddings-inference:latest \
    --model-id BAAI/bge-m3 \
    --dtype float16 \
    --max-batch-tokens 65536

TEI supports dynamic batching, which is critical for maintaining throughput under variable load. It will accumulate incoming requests up to --max-batch-tokens and process them in a single forward pass.

Common Failure Modes {#common-failure-modes}

1. Mixing Embedding Models Across Index and Query

This sounds obvious but happens regularly in production, especially during model migrations. Embeddings from different models are not compatible — they occupy entirely different vector spaces. A cosine similarity of 0.95 between vectors from text-embedding-3-small and text-embedding-3-large is meaningless.

During model migration, re-embed the entire corpus with the new model. There is no shortcut. Shadow-index the new embeddings alongside the old ones, A/B test retrieval quality, then cut over.

2. Not Normalizing After Matryoshka Truncation

As mentioned earlier, truncating a Matryoshka embedding and then computing cosine similarity without re-normalizing produces incorrect rankings. Most vector databases handle normalization internally when using cosine distance, but if computing similarity manually (or using dot product distance), normalize explicitly.

3. Exceeding Max Token Limits Silently

Most embedding APIs silently truncate inputs that exceed the token limit rather than raising an error. OpenAI’s models truncate at 8,191 tokens. Cohere truncates at 512. If the important content of a document is in the second half and the model truncates from the end, the embedding captures none of it.

Fix: chunk documents before embedding. Use a tokenizer to measure length:

import tiktoken

enc = tiktoken.encoding_for_model("text-embedding-3-large")

def safe_chunk(text: str, max_tokens: int = 8000, overlap: int = 200) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap
    return chunks

4. Ignoring the Query/Document Asymmetry

For models that support it (Cohere, Voyage, instruction-tuned models), embedding queries and documents with the same instruction/input_type degrades retrieval. Queries are short and express information needs. Documents are longer and contain information. The model adjusts its representation accordingly.

Query(short, intent-focused)Query EncoderPathShared VectorSpaceDocument(long, info-dense)Doc EncoderPath

Asymmetric encoding: queries and documents take different paths through the model (or receive different instructions), but the output vectors live in the same space for comparison.

5. Embedding Stale Content Without Re-indexing

Embedding vectors don’t update themselves when the underlying content changes. If a product description is edited, the old embedding is now a lie. Production systems need a mechanism to detect content changes and trigger re-embedding. This is an ETL problem, not a model problem, but it’s the most common source of “why did search quality degrade?” complaints.

6. Averaging Embeddings as a Shortcut

A common temptation: embed each chunk of a long document separately, then average the vectors to get a “document embedding.” This works poorly. Averaging dilutes distinctive information and pushes the resulting vector toward the centroid of the embedding space, reducing its discriminative power. Prefer either: (a) embed chunks separately and retrieve at the chunk level, or (b) use a model with a long context window and embed the full document.

Summary {#summary}

Model selection depends on the task, budget, and infrastructure constraints. For most RAG systems, a mid-tier embedding model (OpenAI text-embedding-3-small, Jina v3, or bge-m3) paired with a reranker outperforms a premium embedding model used alone. For maximum retrieval quality without a reranker, Voyage-3-large is the current leader.

Matryoshka embeddings are undersused. Truncating to 256 dimensions sacrifices ~5% quality for 12x storage reduction. At scale, this changes the infrastructure requirements entirely.

Hybrid retrieval (dense + sparse) consistently beats dense-only. bge-m3 makes this easy with a single model. Alternatively, combine any dense model with BM25.

Self-hosting makes economic sense above ~10M embeddings/month for teams with existing GPU infrastructure. Below that, API pricing is simpler and cheaper when accounting for operational overhead.

Always evaluate on your own data. MTEB scores are directional. A model that ranks #1 on MTEB may rank #3 on your specific corpus and query distribution. Build an evaluation set of 100-500 representative queries with known-relevant documents and measure nDCG@10 before committing.

Further Reading {#further-reading}