Create collection with scalar quantization
Vector databases power every RAG pipeline and semantic search system in production today, but the gap between “spin up a hosted index” and “understand why recall dropped 12% after scaling to 50M vectors” is where most teams get stuck.
Table of Contents
- The Core Problem: Nearest Neighbor Search
- Distance Metrics: Choosing the Right Similarity Function
- HNSW: The Algorithm Behind Most Vector Databases
- IVF: Inverted File Indexes
- Product Quantization: Trading Precision for Memory
- Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond
- Flat Search and Brute Force: When Simplicity Wins
- Filtering: The Hard Part Nobody Warns You About
- Database Comparison: Pinecone, Weaviate, Qdrant, pgvector, and Chroma
- Architecture Patterns in Production
- Benchmarking Your Own Workload
- Summary
- Further Reading
The Core Problem: Nearest Neighbor Search
Every vector database solves one problem: given a query vector of d dimensions, find the k most similar vectors from a collection of n vectors. Exact nearest neighbor search requires comparing the query against every vector in the dataset — O(n × d) per query. At 1M vectors with 1536 dimensions (OpenAI’s text-embedding-3-small), that’s ~6 billion floating-point operations per query. Manageable on a single machine with brute force. At 100M vectors, it’s not.
Approximate Nearest Neighbor (ANN) algorithms trade a small amount of recall for orders-of-magnitude speedup. A well-tuned HNSW index at 100M vectors returns results in under 10ms with 95-99% recall — meaning 95-99% of the true nearest neighbors appear in the result set.
Brute force guarantees perfect recall but scales linearly; ANN indexes sacrifice a small amount of accuracy for sublinear query times.
The entire field of vector databases is about the data structures and algorithms that make ANN practical at scale, plus the operational concerns — filtering, updates, persistence, replication — that turn an index into a database.
Distance Metrics: Choosing the Right Similarity Function
The choice of distance metric affects both correctness and index construction. The three that matter:
Cosine similarity measures the angle between vectors, ignoring magnitude. Most embedding models (OpenAI, Cohere, Voyage, BGE) normalize their outputs to unit length, which means cosine similarity reduces to a dot product. If your vectors are already L2-normalized, cosine and dot product are equivalent — use dot product, it’s one fewer operation.
Euclidean distance (L2) measures straight-line distance. For normalized vectors, minimizing L2 distance produces the same ranking as maximizing cosine similarity: ||a - b||² = 2 - 2·cos(a, b) when ||a|| = ||b|| = 1. For non-normalized vectors (some custom models, learned representations from autoencoders), L2 and cosine diverge.
Inner product (dot product) is equivalent to cosine when vectors are normalized. When they’re not, it weights both direction and magnitude — useful when magnitude carries information (e.g., a popularity-weighted embedding where more important items have longer vectors).
| Metric | When to Use | Normalized Equivalent |
|---|---|---|
| Cosine similarity | Default for most embedding models | Same as dot product |
| Euclidean (L2) | Non-normalized vectors, spatial data | Same ranking as cosine when normalized |
| Dot product | Pre-normalized vectors, max-inner-product search | Same as cosine when normalized |
Practical rule: if using any standard embedding API (OpenAI, Cohere, Voyage, Jina, BGE), pick dot product or cosine — they’ll produce identical results because the API returns normalized vectors. Check your model’s documentation. Voyage v3 and Jina v3 both normalize by default. OpenAI’s text-embedding-3-large normalizes by default but supports a dimensions parameter that truncates before re-normalizing, so the shortened vectors remain unit-length.
HNSW: The Algorithm Behind Most Vector Databases
Hierarchical Navigable Small World (HNSW) is the default index type in Qdrant, Weaviate, pgvector (since v0.5.0), Chroma, and the index behind most Pinecone pod configurations. Understanding it well is the single highest-leverage thing for tuning vector search performance.
The Intuition
HNSW builds a multi-layer graph. The bottom layer (layer 0) contains every vector as a node, connected to its approximate nearest neighbors. Upper layers contain progressively fewer nodes — each node at layer l has a probability of appearing at layer l+1, typically 1/M where M is a parameter.
Search starts at the top layer, greedily navigating to the closest node to the query, then drops down a layer and repeats. The top layers provide “express lanes” across the graph; the bottom layer provides fine-grained local search.
Search enters at the top layer and greedily descends, narrowing the search scope at each level.
The Two Parameters That Matter
M (max connections per node): Each node in layer 0 connects to up to 2*M neighbors; nodes in upper layers connect to up to M neighbors. Higher M means better recall and higher memory usage. Default in most implementations: 16. At M=16, each vector stores ~128 neighbor pointers (64-bit IDs) in layer 0, adding ~1KB of graph overhead per vector. At 10M vectors, that’s ~10GB of graph structure alone, on top of the raw vector data.
ef_construction (beam width during build): How many candidates the algorithm tracks while inserting a new node. Higher values produce a better graph but take longer to build. Default: 128-200. Diminishing returns above 400 for most workloads.
ef_search (beam width during query): The runtime equivalent. This is the primary latency/recall knob. At ef_search=50, queries are fast but recall might be 92%. At ef_search=200, recall hits 99% but latency doubles or triples. This is the parameter to tune at query time — it requires no index rebuild.
| Parameter | Default | Tuning Impact | Rebuild Required? |
|---|---|---|---|
| M | 16 | Memory ↑, recall ↑, build time ↑ | Yes |
| ef_construction | 128-200 | Build quality ↑, build time ↑ | Yes |
| ef_search | 100-150 | Recall ↑, latency ↑ | No |
Memory Footprint
For 1M vectors at 1536 dimensions (float32):
- Raw vectors: 1M × 1536 × 4 bytes = ~5.7 GB
- HNSW graph (M=16): ~1-1.5 GB
- Total: ~7-7.2 GB
This is why HNSW is often called “memory-hungry.” The graph overhead is relatively modest, but the requirement to hold all vectors in memory (or memory-mapped storage) means costs scale linearly with collection size. At 100M vectors with 1536-d float32, that’s ~570 GB of raw vector data before graph overhead.
Build Time
HNSW insertion is O(log n) per vector (traversing layers) times O(M × ef_construction) work per layer. For 10M 1536-d vectors on a modern 32-core machine, expect 30-90 minutes for a full index build. Incremental insertions are fast — single vectors insert in <1ms — but inserting many vectors into a pre-built graph produces a slightly worse graph than building from scratch. Most databases handle this transparently with background optimization passes.
IVF: Inverted File Indexes
IVF (Inverted File Index) partitions the vector space into nlist clusters using k-means, then assigns each vector to its nearest cluster centroid. At query time, only the nprobe closest clusters are searched.
IVF narrows the search space by comparing the query to cluster centroids, then only searching the most promising partitions.
Key Parameters
nlist (number of clusters): Typically sqrt(n) to 4*sqrt(n). For 1M vectors, nlist=1000 to 4000. Too few clusters → each cluster is large, search is slow. Too many → cluster boundaries cut through dense regions, hurting recall.
nprobe (clusters searched at query time): The recall/latency knob. nprobe=1 is fastest, nprobe=nlist is equivalent to brute force. Typical range: 5-50. At nprobe=10 with nlist=1024, you’re searching ~1% of the data.
IVF’s advantage over HNSW: lower memory overhead (no graph structure, just centroids and inverted lists) and faster index construction. Its disadvantage: generally lower recall at equivalent query latency. IVF also requires an upfront training step (k-means clustering), which is expensive for large datasets — O(nlist × n × d × iterations).
IVF is the backbone of FAISS (Meta’s library) and shows up as an option in most databases. It’s particularly useful when combined with quantization.
Product Quantization: Trading Precision for Memory
Product Quantization (PQ) compresses vectors by splitting each d-dimensional vector into m subvectors, then replacing each subvector with the index of its nearest centroid from a learned codebook. Instead of storing 1536 float32 values (6144 bytes), PQ with m=96 subquantizers and 8-bit codes stores 96 bytes per vector — a 64x compression.
Product quantization splits vectors into subvectors and replaces each with a codebook index, achieving 64x compression at the cost of distance approximation accuracy.
How Distance Computation Works with PQ
At query time, the database precomputes a lookup table: for each of the m subquantizers, it calculates the distance from the query’s subvector to all 256 centroids (for 8-bit codes). This produces an m × 256 table. Distance to any database vector is then approximated by summing m table lookups — extremely fast, especially with SIMD.
The approximation introduces error. PQ distance is a lossy compression of the true distance. Recall at 95% with PQ typically requires 2-3x overretrieval followed by reranking with exact distances (stored separately or recomputed from original vectors).
Scalar Quantization
A simpler alternative: scalar quantization (SQ) reduces each float32 to int8, yielding a 4x compression with much less distortion than PQ. Qdrant, Weaviate, and pgvector all support SQ. At 1536 dimensions, SQ reduces per-vector storage from 6144 bytes to 1536 bytes. The recall loss is typically <1% for well-distributed embeddings.
Binary quantization goes further — 1 bit per dimension. A 1536-d vector becomes 192 bytes. Useful for first-pass candidate retrieval with reranking, but standalone recall drops to 70-85% for typical embedding distributions.
| Method | Compression | Memory (1M × 1536d) | Typical Recall Loss |
|---|---|---|---|
| None (float32) | 1x | 5.7 GB | 0% |
| Scalar (int8) | 4x | 1.4 GB | <1% |
| Product Quantization | 32-64x | 90-180 MB | 3-8% without reranking |
| Binary | 48x | 120 MB | 15-30% without reranking |
Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond
Real-world indexes combine algorithms. The most common compositions:
IVF-PQ: IVF for coarse partitioning, PQ for compressed storage within each partition. This is FAISS’s bread-and-butter index (faiss.IndexIVFPQ). At 1B vectors, IVF-PQ can fit in ~100GB RAM with reasonable recall. The workflow: find the nearest nprobe centroids, scan PQ codes within those clusters, rerank top candidates with exact vectors.
HNSW + SQ/PQ: Use HNSW graph navigation with quantized vectors for distance computation during search, then optionally rerank with full-precision vectors. Qdrant calls this “quantization” in their configuration; Weaviate supports PQ compression on HNSW indexes.
IVF-HNSW: Use HNSW to find the nearest centroids (replacing the brute-force centroid search in standard IVF). Useful when nlist is very large (>100k).
Composite indexes use coarse search to find candidate partitions, PQ for fast approximate scoring, and optional reranking with full-precision vectors for final accuracy.
The practical takeaway: if your dataset fits in memory with float32 or int8 vectors, pure HNSW with scalar quantization is the simplest path. When memory is the constraint — datasets exceeding available RAM — IVF-PQ or HNSW+PQ becomes necessary.
Flat Search and Brute Force: When Simplicity Wins
For collections under ~100k vectors, brute-force search with SIMD-optimized distance computation often outperforms ANN indexes. A 50k × 1536 float32 matrix fits in ~290 MB. Brute-force scan with AVX-512 on a modern CPU completes in <5ms. The recall is 100%, there’s no index to build or tune, and insertions/deletions are trivial.
pgvector’s default behavior without an index is brute-force scan. For small collections, this is the correct choice. Adding an HNSW index to a 10k-row table adds build time, memory, and configuration surface area with negligible latency improvement.
The crossover point depends on dimensionality and hardware, but as a rough guide:
| Collection Size | Recommended Approach |
|---|---|
| <50k vectors | Brute force (no index) |
| 50k-1M vectors | HNSW with scalar quantization |
| 1M-100M vectors | HNSW + SQ, or IVF-PQ |
| >100M vectors | IVF-PQ with HNSW coarse quantizer, or distributed sharding |
Filtering: The Hard Part Nobody Warns You About
Most real queries aren’t pure vector search. They include metadata filters: “find similar documents where tenant_id = ‘acme’ and created_after = ‘2025-01-01’.” This is where vector databases diverge sharply in capability and where performance often collapses.
Three Filtering Strategies
Pre-filtering: Apply metadata filters first, then search only matching vectors. Guarantees that all results match the filter. Problem: if the filter is very selective (matches 0.1% of vectors), the ANN index can’t be used efficiently — the graph or cluster structure was built over the full dataset, and the filter-matching vectors might be scattered across the graph.
Post-filtering: Run ANN search first, retrieve more candidates than needed, then filter. Fast, but if the filter is selective, most candidates are discarded. At 0.1% filter selectivity, retrieving top-10 results might require scanning top-10,000 before finding 10 matches.
Integrated filtering: Modify the ANN search to respect filters during traversal. HNSW graph traversal skips non-matching nodes. This is the approach Qdrant and Weaviate use. It preserves index efficiency but can degrade recall when filters are very selective — the graph’s connectivity assumptions break down when most nodes are invisible.
Each filtering strategy has a failure mode — the choice depends on filter selectivity and acceptable recall loss.
The Selectivity Cliff
When a metadata filter matches less than ~1-5% of the dataset, all three strategies struggle. Pre-filtering produces a tiny search set where ANN indexes add no value. Post-filtering wastes compute. Integrated filtering navigates a sparse graph.
The practical solution for highly selective filters: maintain separate indexes per partition. If tenant_id produces natural partitions of 10k-1M vectors each, building per-tenant indexes avoids the selectivity problem entirely. Pinecone’s “namespaces” and Qdrant’s “payload indexes” with partitioning address this. Weaviate added multi-tenancy with per-tenant isolation in 2024, which maps well to this pattern.
Database Comparison {#database-comparison}
Pinecone
Architecture: Fully managed, serverless and pod-based tiers. The serverless tier (launched 2024) charges per read/write/storage with no idle cost. Pod-based tier (p1, p2, s1 pods) offers more predictable performance.
Index type: Proprietary. Pinecone hasn’t publicly documented whether they use HNSW, IVF, or something custom in serverless. Pod-based likely uses a modified HNSW. The “sparse-dense” hybrid search suggests separate sparse and dense indexes with score fusion.
Filtering: Integrated metadata filtering. Supports exact match, range, and $in operators. Performance degrades on highly selective filters but is better than naive post-filtering.
Strengths: Zero operational overhead. The serverless pricing model works well for bursty workloads. Sparse-dense hybrid search is built in.
Weaknesses: No self-hosted option. No access to index parameters (M, ef_search) — Pinecone manages these. Vendor lock-in is real. Debugging recall issues requires support tickets, not parameter tuning.
Pricing (serverless, as of early 2026): ~$0.033/1M read units, $2.00/1M write units, $0.033/GB/month storage. A 10M-vector collection with 1536-d vectors costs roughly $20-40/month in storage plus per-query costs.
Qdrant
Architecture: Written in Rust. Runs as a single binary or distributed cluster. Open-source (Apache 2.0) with a managed cloud offering.
Index type: HNSW with configurable M and ef_construction. Supports scalar quantization (int8), product quantization, and binary quantization. Quantization can be configured per-collection.
Filtering: Payload indexes (analogous to secondary indexes in traditional databases) with integrated filtering during HNSW traversal. Supports numeric, keyword, geo, datetime, and full-text payload types. The filtering implementation is notably well-engineered — Qdrant maintains a payload index that integrates with the HNSW traversal, avoiding the worst-case selectivity problems.
Strengths: Best-in-class filtering performance. Fine-grained tuning (M, ef, quantization). Multi-tenancy support. The gRPC API is fast. Snapshot-based backups.
Weaknesses: Distributed mode requires manual shard management (improved in recent versions but not as turnkey as Pinecone). Memory-mapped storage helps with large datasets but introduces I/O latency.
Pricing (cloud): Starts at ~$25/month for a small node. Self-hosted: free.
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, Filter, FieldCondition, MatchValue
client = QdrantClient("localhost", port=6333)
# Create collection with scalar quantization
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
quantization_config={"scalar": {"type": "int8", "quantile": 0.99, "always_ram": True}},
hnsw_config={"m": 16, "ef_construct": 128},
)
# Search with metadata filter
results = client.query_points(
collection_name="documents",
query=[0.1] * 1536, # query vector
query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value="acme"))]),
limit=10,
search_params={"hnsw_ef": 128}, # runtime ef_search
)
Weaviate
Architecture: Written in Go. Modular: separate vectorizer modules can call OpenAI, Cohere, etc. at ingest time. Open-source (BSD-3) with a managed cloud.
Index type: HNSW (custom implementation). Added PQ compression in 2024. Flat index option for small collections.
Filtering: Inverted index on properties combined with HNSW search. Weaviate builds a roaring bitmap per filterable property and intersects it with the HNSW candidate set. Pre-filtering is the default; it switches to post-filtering when the filter is too selective and pre-filtering would be too slow.
Strengths: Built-in vectorization (no need to manage embedding API calls separately). GraphQL and REST APIs. Multi-tenancy is well-implemented — each tenant gets isolated storage and index resources. Hybrid search (BM25 + vector) is native.
Weaknesses: Go’s garbage collector can cause latency spikes under heavy load (measurably worse p99 than Qdrant’s Rust-based engine at high QPS). The module system adds complexity. Memory consumption tends to be higher than Qdrant for equivalent workloads.
Pricing (cloud): Serverless tier starts free with pay-per-use. Dedicated clusters from ~$25/month.
pgvector
Architecture: PostgreSQL extension. Vectors are stored as a column type. Indexes are built using standard CREATE INDEX syntax.
Index type: HNSW (added in v0.5.0, mid-2023) and IVFFlat. HNSW is the recommended choice for most workloads. As of pgvector 0.8.x (current), HNSW supports int8 and binary quantization via halfvec and bit types.
Filtering: This is pgvector’s killer feature. Metadata filtering uses PostgreSQL’s query planner — WHERE tenant_id = 'acme' combined with ORDER BY embedding <=> query_vector LIMIT 10 leverages PostgreSQL’s ability to combine B-tree and HNSW indexes. Partial indexes work: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WHERE tenant_id = 'acme'.
Strengths: No new infrastructure. If you’re already running PostgreSQL, pgvector adds vector search with a CREATE EXTENSION command. Full SQL expressiveness for filtering. Transactional consistency with your application data. Partial indexes for per-tenant isolation.
Weaknesses: Performance ceiling. pgvector’s HNSW implementation is single-threaded for index builds (parallel build landed in 0.8.0 but is still slower than Qdrant/FAISS). At >5M vectors, query latency and memory usage become painful compared to purpose-built engines. No built-in replication of vector indexes across read replicas (the index must be built on each replica). No native quantization beyond halfvec (half-precision float16) and bit types.
Pricing: Whatever your PostgreSQL host costs. On AWS RDS, a db.r7g.xlarge (32 GB RAM) runs ~$350/month and comfortably handles 2-5M 1536-d vectors.
-- pgvector: create table, index, and query with filtering
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id bigserial PRIMARY KEY,
tenant_id text NOT NULL,
content text,
embedding vector(1536),
created_at timestamptz DEFAULT now()
);
-- B-tree index for filtering
CREATE INDEX idx_tenant ON documents (tenant_id);
-- HNSW index with custom parameters
CREATE INDEX idx_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
-- Query: combined metadata filter + vector search
SET hnsw.ef_search = 100;
SELECT id, content, embedding <=> $1::vector AS distance
FROM documents
WHERE tenant_id = 'acme' AND created_at > '2025-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 10;
Chroma
Architecture: Python-native. Uses hnswlib under the hood for vector indexing and DuckDB+Parquet or SQLite for metadata. Open-source (Apache 2.0). Designed as an “AI-native” embedding database.
Index type: HNSW via hnswlib. Limited configurability — M and ef_construction are exposed but the defaults are reasonable for small-to-medium collections.
Filtering: Metadata filtering via where clauses. Implemented as post-filtering on the HNSW results. At selective filters, this means poor recall or requiring large overretrieval.
Strengths: Simplest possible API. Five lines to ingest and query. Excellent for prototyping, local development, and small-scale production (<1M vectors). In-process mode (no server needed) is genuinely useful for notebook workflows and testing.
Weaknesses: Not designed for large-scale production. No built-in clustering, replication, or sharding. Filtering is post-filter only, which breaks at scale. hnswlib’s index is fully in-memory with no memory-mapped option. The Python process becomes the bottleneck at high QPS.
Pricing: Free (open-source). Chroma Cloud (hosted, in GA as of late 2025) has a free tier and paid plans starting at ~$30/month.
Comparison Table
| Feature | Pinecone | Qdrant | Weaviate | pgvector | Chroma |
|---|---|---|---|---|---|
| Language | Proprietary | Rust | Go | C (PG ext) | Python |
| Self-hosted | No | Yes | Yes | Yes | Yes |
| Max practical scale | Billions | 100M+ | 100M+ | ~5-10M | ~1M |
| HNSW tuning | No | Full | Limited | Full | Limited |
| Quantization | Unknown | SQ, PQ, Binary | PQ, BQ | halfvec, bit | None |
| Filtering quality | Good | Excellent | Good | Excellent (SQL) | Poor |
| Hybrid search | Yes (sparse-dense) | Yes (sparse vectors) | Yes (BM25 + vector) | Manual (tsvector + vector) | No |
| Multi-tenancy | Namespaces | Payload partitions | Native tenant isolation | Partial indexes | Collections |
| Managed cloud | Yes (only) | Yes | Yes | RDS/Aurora/Supabase | Yes (beta) |
| Minimum cost | ~$0/mo (serverless idle) | $25/mo (cloud) | Free tier (cloud) | Existing PG cost | Free |
Database choice depends primarily on scale and whether you’re already running PostgreSQL.
Architecture Patterns in Production
Pattern 1: pgvector for Everything Under 5M Vectors
If the application already uses PostgreSQL and the vector collection stays under 5M rows, pgvector eliminates an entire infrastructure dependency. Vectors live alongside relational data. JOINs work. Transactions are ACID. The developer experience is “just SQL.”
The simplest production architecture: PostgreSQL handles both application data and vector search.
The failure mode: query latency creeps up as the collection grows. HNSW index build times (15-45 minutes for 5M 1536-d vectors) block vacuum operations. At this point, extract vector search into a dedicated engine.
Pattern 2: Dedicated Vector DB + Application DB
The standard pattern at scale. Application data stays in PostgreSQL/MySQL. Vectors and associated metadata live in Qdrant, Weaviate, or Pinecone. The application writes to both on ingest and queries the vector DB for retrieval.
Separating vector search from the application database introduces consistency challenges but enables independent scaling.
The consistency challenge: if a document is updated in the application DB but the vector DB hasn’t been re-indexed yet, search returns stale results. Solutions: (1) synchronous dual-write (slow, fragile), (2) change data capture (CDC) pipeline from the application DB to trigger re-embedding and vector upsert, or (3) accept eventual consistency with a bounded lag (the most common choice).
Pattern 3: Hybrid Search with Reranking
Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search runs both and merges results.
Hybrid search combines keyword and vector retrieval, then uses a reranker to produce the final ranking.
Reciprocal Rank Fusion (RRF) is the simplest merge strategy: score = Σ 1/(k + rank_i) across result lists, where k is typically 60. It works well without tuning. Weighted linear combination requires calibrating the relative weights of sparse vs. dense scores, which vary by query type.
The reranker matters more than the merge strategy. A cross-encoder reranker (Cohere Rerank v3, Jina Reranker v2, or a fine-tuned bge-reranker-v2-m3) evaluates query-document pairs jointly. Rerankers are expensive — 10-50ms per candidate pair — so they’re applied to the merged top-50 or top-100 candidates.
Weaviate supports hybrid search natively (alpha parameter controls BM25 vs. vector weighting). Pinecone supports sparse-dense vectors in the same index. For Qdrant and pgvector, hybrid search requires running two queries and merging in application code.
Pattern 4: Tiered Storage for Cost Optimization
At 100M+ vectors, keeping everything in memory-backed HNSW is expensive. A tiered approach: hot data (recent, frequently accessed) in a fast HNSW index, cold data in an IVF-PQ or disk-backed index.
Qdrant supports on-disk storage with memory-mapped files. Queries to memory-mapped indexes are ~2-5x slower than in-memory but cost 10x less (disk vs. RAM). The configurable on_disk flag per collection and memmap_threshold parameter control when data moves to disk.
Benchmarking Your Own Workload
Public benchmarks (ANN Benchmarks, VectorDBBench) test synthetic workloads with uniform distributions. Real embeddings from production data have different clustering properties, and real queries include filters. Running benchmarks on your own data with your own query patterns is non-negotiable before choosing a database.
A Practical Benchmarking Protocol
- Sample 100k-1M vectors from your actual data (or all of it if smaller). Include representative metadata fields.
- Generate ground truth: brute-force top-100 for 1000 random query vectors. This is your recall baseline.
- Measure recall@k: For each database/configuration, run the same 1000 queries and compute
|intersection(ann_results, ground_truth)| / k. Measure at k=10 and k=100. - Measure with filters: Repeat with metadata filters at 50%, 10%, 1%, and 0.1% selectivity. This is where databases diverge most.
- Measure under load: Run queries at 10, 50, 100, 500 QPS. Measure p50, p95, p99 latency. Purpose-built databases (Qdrant, Weaviate) should sustain <20ms p95 at 100 QPS for 1M vectors; pgvector will likely show higher p99.
- Measure index build time and memory: Track RSS memory and wall-clock build time. These determine infrastructure cost.
import numpy as np
from time import perf_counter
def measure_recall(ground_truth_ids, ann_result_ids, k):
"""Compute recall@k: fraction of true top-k neighbors found by ANN."""
recalls = []
for gt, ann in zip(ground_truth_ids, ann_result_ids):
gt_set = set(gt[:k])
ann_set = set(ann[:k])
recalls.append(len(gt_set & ann_set) / k)
return np.mean(recalls)
# Example: brute-force ground truth with numpy
def brute_force_topk(query_vectors, db_vectors, k=100):
"""Compute exact top-k for recall measurement."""
# For cosine similarity with normalized vectors, use dot product
similarities = query_vectors @ db_vectors.T # (n_queries, n_db)
top_k_indices = np.argpartition(-similarities, kth=k, axis=1)[:, :k]
# Sort the top-k by similarity
for i in range(len(query_vectors)):
order = np.argsort(-similarities[i, top_k_indices[i]])
top_k_indices[i] = top_k_indices[i, order]
return top_k_indices
What “Good” Looks Like
For a RAG system feeding results to an LLM: recall@10 ≥ 0.95 is the floor. Below that, the LLM frequently misses relevant context. p95 query latency under 50ms keeps the retrieval step from dominating the total response time (LLM inference is 500-3000ms).
For recommendation systems with user-facing latency requirements: p99 < 20ms at the expected QPS, with recall@50 ≥ 0.90 (the downstream ranking model compensates for retrieval misses).
Summary
HNSW is the default index for most workloads. Tune ef_search at runtime for recall/latency tradeoff; tune M and ef_construction at build time for recall/memory tradeoff.
IVF-PQ is for scale beyond available RAM. 64x vector compression makes billion-vector search feasible on commodity hardware, at the cost of recall that requires reranking to compensate.
Scalar quantization (int8) is the lowest-risk compression — <1% recall loss, 4x memory savings. Use it by default on any collection over 1M vectors.
Filtering is the most common source of production issues. Test with realistic filter selectivity. Per-tenant partitioning (separate indexes or namespaces) avoids the selectivity cliff.
pgvector is the right answer when you have <5M vectors and already run PostgreSQL. The operational simplicity outweighs the performance gap for most applications at this scale.
Qdrant has the strongest combination of performance, tunability, and filtering quality among the purpose-built open-source options. Weaviate is a solid choice when native hybrid search and built-in vectorization matter. Pinecone makes sense when zero operational burden is worth the cost premium and lock-in.
Chroma is a prototyping tool. Using it in production beyond small-scale deployments creates scaling problems that purpose-built databases solved years ago.
Benchmark on your own data with your own filters. Public benchmarks test conditions that probably don’t match your workload.
Further Reading
- FAISS Wiki — Guidelines for Index Choice — Meta’s decision tree for selecting FAISS index types based on dataset size and constraints
- Qdrant Documentation — Quantization — Detailed guide on scalar, product, and binary quantization configuration in Qdrant
- pgvector GitHub Repository — Source, HNSW/IVFFlat documentation, and performance tuning notes for the PostgreSQL extension
- ANN Benchmarks — Standardized performance comparisons across ANN algorithms and implementations
- Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs” (2018) — The original HNSW paper, essential for understanding the algorithm’s design
- Jégou et al., “Product Quantization for Nearest Neighbor Search” (2011) — The foundational paper on PQ, still the basis for most compressed vector search implementations
- Weaviate Documentation — Hybrid Search — Implementation details for BM25 + vector fusion in Weaviate
- Pinecone Learning Center — Understanding Indexes — Pinecone’s documentation on serverless vs. pod-based index architectures
- VectorDBBench — Open-source benchmarking tool from Zilliz that tests vector databases with configurable workloads and filtering