How Vector Databases Actually Work
Vector databases power every retrieval-augmented AI system. Here’s how HNSW, IVF, and product quantization actually work, with tradeoffs across Pinecone, Weaviate, Qdrant, pgvector, and Chroma.
How Vector Databases Actually Work
Every RAG pipeline, semantic search system, and recommendation engine built on LLMs depends on a vector database. The core operation is deceptively simple: store high-dimensional vectors, retrieve the ones closest to a query vector. The engineering behind doing that at scale — billions of vectors, sub-100ms latency, filtered by metadata — is where things get interesting and where choices start to matter.
The vector database market has consolidated somewhat since the 2023-2024 gold rush, but the surviving options have meaningfully different architectures, tradeoff profiles, and failure modes. Picking the right one requires understanding what’s happening under the hood.
Table of Contents
- The Core Problem: Nearest Neighbor Search
- Distance Metrics
- HNSW: The Dominant Index
- IVF: Inverted File Indexes
- Product Quantization: Compressing Vectors
- Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond
- Filtered Search: Where Things Get Hard
- Storage Architecture: In-Memory vs Disk vs Hybrid
- The Database Comparison
- Pinecone
- Weaviate
- Qdrant
- pgvector
- Chroma
- Benchmarks and Real-World Performance
- Choosing the Right Database
- Summary
- Further Reading
The Core Problem: Nearest Neighbor Search
Given a query vector q of dimension d and a dataset of n vectors, find the k vectors closest to q. Exact brute-force search computes n distance calculations, each costing O(d) operations. For 10 million 1536-dimensional vectors (a modest OpenAI embedding collection), that’s ~15 billion floating-point operations per query.
Brute force works for small datasets. Under ~50,000 vectors, a well-optimized linear scan with SIMD instructions finishes in single-digit milliseconds. Past that threshold, approximate nearest neighbor (ANN) algorithms become necessary. They trade a small amount of recall — the fraction of true nearest neighbors actually returned — for orders-of-magnitude speedup.
Brute-force search guarantees perfect recall but doesn’t scale. ANN indexes sacrifice a few percentage points of recall for dramatic speedups.
Distance Metrics
The choice of distance metric affects both correctness and performance.
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine similarity | dot(a,b) / (‖a‖·‖b‖) | [-1, 1] | Text embeddings (most common) |
| Euclidean (L2) | √Σ(aᵢ - bᵢ)² | [0, ∞) | Image embeddings, spatial data |
| Dot product | Σ(aᵢ · bᵢ) | (-∞, ∞) | Pre-normalized vectors, MaxSim |
| Manhattan (L1) | Σ | aᵢ - bᵢ |
Cosine similarity and dot product are equivalent when vectors are L2-normalized, which most text embedding models produce by default (OpenAI, Cohere, Voyage all output normalized vectors). Most databases store normalized vectors and use dot product internally since it’s cheaper — no division by norms.
The metric must match the embedding model’s training objective. Using L2 distance on vectors trained with cosine similarity still works (the ranking is preserved for normalized vectors) but mixing cosine-trained embeddings with dot product on unnormalized vectors will produce garbage results.
HNSW: The Dominant Index
Hierarchical Navigable Small World (HNSW), published by Malkov and Yashunin in 2018, is the default index type in nearly every vector database. Qdrant, Weaviate, pgvector (via hnsw_cosine_ops), Pinecone, and Chroma all use it as their primary or sole index structure.
How It Works
HNSW builds a multi-layer graph. Each layer is a navigable small world graph where nodes are vectors and edges connect nearby vectors. The top layer is the sparsest — very few vectors, connected by long-range links. Each successive layer adds more vectors with shorter-range connections. The bottom layer contains all vectors.
Search starts at the top layer, greedily navigating to the node closest to the query. It then drops to the next layer, using that node as an entry point, and repeats. At the bottom layer, it performs a more thorough beam search.
HNSW search traverses layers top-down. Upper layers provide coarse navigation; the bottom layer provides precise neighborhood exploration.
The Key Parameters
M — the number of bidirectional links per node. Higher M means better recall but more memory. Default is typically 16. Each node stores 2·M links at layer 0 and M links at higher layers.
ef_construction — the beam width during index building. Higher values produce a better-connected graph at the cost of slower indexing. Default is usually 128-200. This is a build-time parameter; once the index is built, it’s fixed.
ef_search — the beam width during query. This is the primary recall/speed knob at query time. Higher ef_search means more nodes explored, better recall, slower queries. The relationship is roughly: ef_search=50 gives ~95% recall, ef_search=200 gives ~99% recall, ef_search=500 gets close to 99.9%.
# Qdrant HNSW configuration example
from qdrant_client.models import HnswConfigDiff
hnsw_config = HnswConfigDiff(
m=16, # links per node
ef_construct=200, # build-time beam width
full_scan_threshold=10000, # brute-force below this count
)
Memory Cost
HNSW’s main drawback: memory consumption. Each vector occupies d × 4 bytes (float32), plus the graph structure adds roughly M × 2 × 8 bytes per vector (for the adjacency list with 64-bit pointers). For 10M vectors at 1536 dimensions with M=16:
- Vectors: 10M × 1536 × 4 = ~57 GB
- Graph: 10M × 16 × 2 × 8 = ~2.4 GB
- Total: ~60 GB
This is the fundamental tension. HNSW wants everything in memory. Solutions include quantization (discussed below), memory-mapped files, and tiered storage.
Time Complexity
Index building: O(n · log(n) · M · ef_construction). For 10M vectors, expect 20-60 minutes depending on hardware and parameters.
Query: O(log(n) · ef_search · d). In practice, 1-10ms per query at 95%+ recall for collections under 100M vectors.
IVF: Inverted File Indexes
IVF (Inverted File Index) takes a different approach: partition the vector space into clusters using k-means, then at query time only search the clusters closest to the query.
How It Works
- Training phase: Run k-means clustering on the dataset (or a representative sample) to produce nlist centroids.
- Assignment: Each vector is assigned to its nearest centroid and stored in that centroid’s posting list.
- Query: Compute distances from the query to all centroids, select the nprobe closest centroids, then scan only those posting lists.
IVF partitions the vector space into clusters. Queries only scan a subset of clusters, controlled by the nprobe parameter.
Parameters
nlist — number of clusters. Rule of thumb: sqrt(n) to 4·sqrt(n). For 10M vectors: 3162 to 12649. More clusters means each posting list is shorter (faster scan) but more likely to miss relevant vectors in adjacent clusters.
nprobe — clusters scanned at query time. Higher nprobe = better recall, slower queries. nprobe=1 gives ~40-60% recall. nprobe=10 gives ~85-95%. nprobe=sqrt(nlist) is a reasonable starting point.
IVF vs HNSW Tradeoffs
| Property | HNSW | IVF |
|---|---|---|
| Memory | High (graph + vectors) | Lower (vectors + centroids) |
| Build time | Slow | Moderate (k-means training) |
| Query latency at 95% recall | 1-5ms | 2-10ms |
| Supports on-disk storage | With mmap, not ideal | Natural fit |
| Update/insert cost | Moderate | Cheap (assign to cluster) |
| Scaling past 1B vectors | Needs careful tuning | More natural with sharding |
HNSW generally wins on query quality and latency. IVF wins on memory efficiency and disk-friendliness. In practice, IVF is most often used in combination with quantization (IVF-PQ) for very large collections where HNSW’s memory requirements become prohibitive.
Product Quantization: Compressing Vectors
Product quantization (PQ) compresses vectors from 32-bit floats to compact codes, typically reducing memory by 8-32x. It’s the single most important technique for scaling vector search past the point where raw vectors fit in memory.
How It Works
- Split the d-dimensional vector into m subvectors of dimension d/m. For a 1536-d vector with m=96, each subvector has 16 dimensions.
- Train a codebook of k centroids (typically k=256) for each subvector using k-means. 256 centroids means each subvector maps to 1 byte (8 bits).
- Encode each vector as m bytes — one centroid ID per subvector. A 1536-d float32 vector (6144 bytes) becomes 96 bytes. That’s 64x compression.
- Search uses precomputed distance tables: for each query subvector, compute distances to all 256 centroids, then approximate full distances by summing subvector distances from the table.
Product quantization splits vectors into subvectors, maps each to a centroid ID, and stores only the IDs. Distance computation uses precomputed lookup tables.
Quantization Quality
PQ introduces quantization error — the distance between the original vector and the reconstructed vector (centroid values substituted back). More subvectors (m) and more centroids (k) reduce error but increase code size and codebook training time.
Typical configurations:
| Dimensions | m (subvectors) | k (centroids) | Code size | Compression | Recall@10 impact |
|---|---|---|---|---|---|
| 1536 | 96 | 256 | 96 bytes | 64x | -3-5% |
| 1536 | 192 | 256 | 192 bytes | 32x | -1-2% |
| 768 | 96 | 256 | 96 bytes | 32x | -2-4% |
| 384 | 48 | 256 | 48 bytes | 32x | -2-3% |
A common pattern: use PQ for the initial candidate retrieval, then re-rank the top candidates using exact distances against the original vectors. This “asymmetric distance computation” (ADC) approach gets near-lossless recall with PQ-level memory consumption.
Scalar Quantization
Simpler than PQ but less aggressive: map each float32 dimension to int8 (or int4). A 1536-d vector goes from 6144 bytes to 1536 bytes (int8) — 4x compression. Less lossy than PQ, less compact.
Qdrant and Weaviate both support scalar quantization as a lighter-weight alternative to PQ. For many workloads, int8 scalar quantization with rescoring gives near-identical recall to float32 at 4x memory savings.
# Qdrant scalar quantization configuration
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType
scalar_config = ScalarQuantization(
scalar=ScalarQuantizationConfig(
type=ScalarType.INT8,
quantile=0.99, # clip outliers
always_ram=True, # keep quantized vectors in RAM
)
)
Binary Quantization
The most extreme option: each dimension becomes 1 bit. A 1536-d vector compresses to 192 bytes (192x compression from float32). Distance computation uses POPCNT (population count) CPU instructions on XOR’d bit vectors — extremely fast.
Binary quantization works surprisingly well for high-dimensional embeddings (1024+ dimensions) from models like OpenAI’s text-embedding-3-large or Cohere’s embed-v4. Recall drops maybe 5-10% but can be recovered with rescoring. For lower-dimensional embeddings (384-d), the information loss is usually too high.
Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond
The real power comes from composition.
IVF-PQ
The classic Faiss combination. IVF partitions the space into clusters; PQ compresses the residual vectors (vector minus cluster centroid) within each cluster. Residual PQ is more accurate than raw PQ because the residuals have smaller magnitude and are more uniformly distributed.
IVF-PQ is how Meta runs similarity search on billion-scale datasets. 1B vectors at 768-d with IVF65536-PQ96 fits in ~100 GB RAM. Brute force would require ~3 TB.
HNSW + Quantization
Qdrant and Weaviate use this approach: build the HNSW graph over quantized vectors (scalar or PQ), traverse the graph using cheap quantized distance computations, then rescore the final candidates against original vectors stored on disk.
Two-phase search: fast quantized traversal for candidate generation, then full-precision rescoring for final ranking.
DiskANN
Microsoft’s DiskANN (Vamana graph) is worth mentioning: it builds an HNSW-like graph but is specifically designed for SSD-resident data. The graph structure is optimized for sequential disk reads by organizing nodes to minimize random I/O. Qdrant has incorporated some DiskANN ideas into its on-disk index implementation.
Filtered Search: Where Things Get Hard
Real applications don’t just search vectors. They search vectors with metadata filters: “find similar documents where category=‘legal’ and created_after=‘2025-01-01’.” This is where vector databases diverge most in architecture and where performance can degrade catastrophically.
The Pre-filter vs Post-filter Problem
Post-filter: Run the ANN search, get top-k candidates, then apply metadata filters. Problem: if only 1% of vectors match the filter, you need to retrieve 100x more candidates to get k results. At extreme selectivity, this degenerates to brute force.
Pre-filter: Build a candidate set from the metadata index, then run vector search within that set. Problem: the ANN index was built over the full dataset. You can’t easily restrict an HNSW traversal to a subset of nodes — the graph structure assumes all nodes are reachable.
Integrated filter: Build the ANN index to be filter-aware. This is the hard, correct approach.
Three approaches to filtered vector search, each with distinct failure modes.
How Each Database Handles Filtering
Qdrant uses a payload index (similar to a traditional database index on metadata fields) combined with a filterable HNSW. During graph traversal, it checks filter conditions on each visited node and skips non-matching nodes. When the filter is very selective (< 1% match), it falls back to a pre-filtered set with brute-force search. This adaptive approach handles edge cases well.
Weaviate uses a roaring bitmap intersection approach. It maintains inverted indexes on metadata properties, computes a bitmap of matching document IDs, then traverses the HNSW graph with the bitmap as a mask. This works well for structured filters but can struggle with complex boolean combinations.
Pinecone keeps its filtering implementation proprietary but handles metadata filters natively at the index level. Based on benchmark results, it probably uses a pre-filter + merge approach.
pgvector inherits PostgreSQL’s query planner. A query with WHERE category = 'legal' ORDER BY embedding <=> query LIMIT 10 can use either the vector index or the btree index first, depending on estimated selectivity. In practice, the planner often makes poor choices for combined vector + metadata queries, and you may need SET enable_seqscan = off or explicit index hints.
Chroma applies metadata filters after retrieving from the HNSW index (post-filter). For highly selective filters on large collections, this becomes a problem.
Storage Architecture: In-Memory vs Disk vs Hybrid
| Approach | Databases | Latency | Cost | Scale limit |
|---|---|---|---|---|
| Fully in-memory | Chroma, early Qdrant | <1ms | $$$ | RAM-bound |
| Memory-mapped files | Qdrant, Weaviate | 1-5ms | $$ | Disk-bound |
| Disk-native | DiskANN variants | 5-20ms | $ | SSD throughput |
| Tiered (hot/warm/cold) | Pinecone, Weaviate | Variable | $-$$$ | Practically unlimited |
The trend is toward memory-mapped and tiered storage. Keeping 1B float32 vectors at 1536-d in RAM requires ~5.7 TB. Even at cloud memory prices of ~$5/GB/month, that’s $28,500/month just for storage. With scalar quantization to int8, it drops to ~1.4 TB ($7,100/month). With PQ at 96 bytes per vector, ~90 GB ($450/month). Quantization isn’t optional at scale — it’s an economic requirement.
Tiered storage keeps quantized vectors and the graph in memory for fast traversal, with full-precision vectors on SSD for rescoring.
The Database Comparison
Pinecone
Architecture: Fully managed, proprietary. Runs on AWS/GCP. Serverless tier (launched 2024) bills per query and per GB stored, eliminating idle costs.
Index: Proprietary implementation. Supports cosine, euclidean, and dot product. Metadata filtering is native and generally fast.
Strengths: Zero operational overhead. Serverless pricing model is excellent for bursty workloads. Good default performance without tuning. Namespaces for logical separation within an index.
Weaknesses: No self-hosting option. Vendor lock-in. Limited observability into index internals. No way to tune HNSW parameters. The serverless tier has cold-start latency (50-200ms on first query to a dormant namespace). Pricing gets expensive at sustained high throughput — the pod-based tier is more cost-effective for steady-state workloads.
Pricing (April 2026): Serverless: $0.04/1M read units + $0.33/GB/month storage. Pod-based: s1 pods from ~$70/month per pod.
Best for: Teams that want zero infrastructure management and have variable query patterns.
# Pinecone upsert + query
from pinecone import Pinecone
pc = Pinecone(api_key="...")
index = pc.Index("my-index")
index.upsert(
vectors=[
{"id": "doc1", "values": embedding, "metadata": {"category": "legal", "year": 2025}},
],
namespace="contracts"
)
results = index.query(
vector=query_embedding,
top_k=10,
namespace="contracts",
filter={"category": {"$eq": "legal"}, "year": {"$gte": 2024}}
)
Weaviate
Architecture: Open source (BSD-3), written in Go. Can run self-hosted or via Weaviate Cloud. Supports multi-tenancy natively.
Index: HNSW with configurable parameters. Supports flat, dynamic, and HNSW index types. Dynamic index switches from flat (brute-force) to HNSW automatically when a collection exceeds a threshold.
Unique features: Built-in vectorization modules — can call OpenAI, Cohere, or local models to generate embeddings at ingest time. GraphQL and REST APIs. BM25 + vector hybrid search built in. Generative modules that run LLM calls on search results.
Strengths: Integrated vectorizer pipeline reduces application code. Multi-tenancy is well-implemented for SaaS use cases (each tenant gets isolated data with efficient resource sharing). Hybrid search (BM25 + vector) is a first-class feature, not bolted on. Product quantization and binary quantization supported.
Weaknesses: Go codebase means less community contribution compared to Rust-based alternatives. Memory consumption can be high for large collections without quantization. The GraphQL API is powerful but has a learning curve. Module system adds operational complexity.
Pricing (self-hosted): Free. Weaviate Cloud: starts at ~$25/month for sandbox, production tiers from $100+/month.
# Weaviate hybrid search with filter
import weaviate
import weaviate.classes as wvc
client = weaviate.connect_to_local()
collection = client.collections.get("Document")
results = collection.query.hybrid(
query="contract termination clause",
alpha=0.7, # 0=BM25 only, 1=vector only
limit=10,
filters=wvc.query.Filter.by_property("category").equal("legal"),
return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
)
Qdrant
Architecture: Open source (Apache 2.0), written in Rust. Single binary, no external dependencies. gRPC and REST APIs.
Index: HNSW with extensive tuning knobs. Supports scalar (int8), product, and binary quantization. On-disk index mode backed by memory-mapped files.
Unique features: Payload indexes (secondary indexes on metadata) with multiple index types (keyword, integer, float, geo, datetime, full-text). Flexible filtering during HNSW traversal with automatic strategy selection based on filter selectivity. Sparse vector support for hybrid search (store both dense and sparse vectors in the same collection). Named vectors — store multiple vector representations per point.
Strengths: Best-in-class filtering performance. Rust implementation gives excellent memory safety and performance. Configuration is explicit and well-documented. Quantization options are the most flexible of any vector database. The Qdrant team publishes detailed benchmarks (ann-benchmarks integration). Sparse vectors + dense vectors in one collection makes hybrid search clean.
Weaknesses: Operational complexity for multi-node clusters (Raft-based consensus). Documentation, while good, assumes more infrastructure knowledge than Pinecone. No built-in vectorizer pipeline — you manage embedding generation.
Pricing (self-hosted): Free. Qdrant Cloud: from $0.016/GB/month (storage) + compute.
# Qdrant multi-vector search with quantization
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct,
Filter, FieldCondition, MatchValue,
SearchParams, QuantizationSearchParams,
)
client = QdrantClient(url="http://localhost:6333")
# Search with quantization rescoring
results = client.query_points(
collection_name="documents",
query=query_embedding,
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="legal"))]
),
search_params=SearchParams(
hnsw_ef=128,
quantization=QuantizationSearchParams(
rescore=True, # rescore top candidates with full vectors
oversampling=2.0, # fetch 2x candidates before rescoring
),
),
limit=10,
)
pgvector
Architecture: PostgreSQL extension. Version 0.8.0 (current as of April 2026). Adds vector column type and ANN index support to Postgres.
Index types: HNSW (hnsw) and IVFFlat (ivfflat). HNSW is the recommended default. IVFFlat requires a separate training step and is less flexible.
Strengths: It’s Postgres. ACID transactions. Joins across vector and relational data. Existing Postgres tooling, monitoring, backups, replication all work. No new infrastructure to manage. For applications already using Postgres, adding vector search is a one-migration operation.
Weaknesses: Performance ceiling. pgvector’s HNSW implementation is single-threaded for index builds (as of 0.8.0). Filtered search relies on the Postgres query planner, which doesn’t always make optimal decisions for combined vector + metadata queries. No built-in quantization (though there’s a separate pgvector-quantize extension). Memory management is Postgres’s shared_buffers model, which wasn’t designed for vector workloads. At 10M+ vectors, purpose-built vector databases are 2-5x faster.
Scaling: pgvector tops out around 5-10M vectors per Postgres instance with acceptable latency. Beyond that, you either shard across multiple Postgres instances (complex) or move to a purpose-built system. Managed services like Neon and Supabase offer pgvector with auto-scaling that pushes this ceiling higher.
-- pgvector setup and query
CREATE EXTENSION vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
category TEXT,
embedding vector(1536)
);
-- Build HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Create a btree index for filtered queries
CREATE INDEX ON documents (category);
-- Combined vector + metadata query
SELECT id, content, embedding <=> $1::vector AS distance
FROM documents
WHERE category = 'legal'
ORDER BY embedding <=> $1::vector
LIMIT 10;
Chroma
Architecture: Open source (Apache 2.0), Python-native. Designed as the “SQLite of vector databases” — embedded mode runs in-process. Also supports client-server mode.
Index: HNSW via hnswlib. Simple configuration.
Strengths: The simplest possible getting-started experience. pip install chromadb, create a collection, add documents. Built-in embedding functions (wraps OpenAI, Cohere, etc.). Excellent for prototyping, notebooks, and small applications. The embedded mode means no network hops — sub-millisecond for small collections.
Weaknesses: Not designed for production scale. No quantization. Metadata filtering is post-filter (performance degrades with selective filters on large collections). Limited index tuning. No built-in replication or sharding. The hosted Chroma Cloud offering is still maturing. Performance falls off sharply past 1M vectors.
Best for: Prototyping, development, small applications (<500K vectors), local-first applications.
# Chroma — minimal setup
import chromadb
client = chromadb.Client() # in-memory
# client = chromadb.PersistentClient(path="./chroma_data") # persistent
collection = client.create_collection(
name="documents",
metadata={"hnsw:M": 16, "hnsw:construction_ef": 200},
)
collection.add(
ids=["doc1", "doc2"],
documents=["contract termination clause...", "payment terms..."],
metadatas=[{"category": "legal"}, {"category": "finance"}],
)
results = collection.query(
query_texts=["termination policy"],
n_results=5,
where={"category": "legal"},
)
Benchmarks and Real-World Performance
The most reliable public benchmark is ann-benchmarks.com, but it tests raw ANN performance without metadata filtering, persistence, or concurrent queries — all of which matter in production.
Qdrant publishes comparative benchmarks at vector-db-benchmark that include filtering scenarios. Taking their results with appropriate salt (they’re benchmarking their own product), the general patterns are consistent with independent testing:
Unfiltered Search (1M vectors, 768-d, 99% recall)
| Database | QPS (single thread) | QPS (parallel) | p99 latency |
|---|---|---|---|
| Qdrant | ~800 | ~4,500 | 3ms |
| Weaviate | ~600 | ~3,200 | 5ms |
| pgvector | ~300 | ~900 | 12ms |
| Chroma | ~400 | ~1,500 | 8ms |
| Pinecone (serverless) | N/A (managed) | ~2,000 | 15ms |
Numbers are approximate, from multiple benchmark sources as of early 2026. Actual performance depends on hardware, configuration, and data distribution.
Filtered Search (1M vectors, 1% selectivity filter)
This is where the differences are starkest:
| Database | QPS (parallel) | p99 latency | Notes |
|---|---|---|---|
| Qdrant | ~3,800 | 5ms | Adaptive filter strategy |
| Weaviate | ~2,200 | 9ms | Roaring bitmap filtering |
| pgvector | ~150 | 45ms | Query planner often suboptimal |
| Chroma | ~200 | 35ms | Post-filter degrades here |
| Pinecone | ~1,500 | 20ms | Managed, limited tuning |
Filtered search performance diverges as filter selectivity increases. The 1% selectivity threshold is where architectural differences become most visible.
Memory Efficiency (10M vectors, 1536-d)
| Configuration | Memory per vector | Total for 10M |
|---|---|---|
| Float32, no quantization | ~6.4 KB | ~60 GB |
| Scalar int8 + HNSW | ~1.8 KB | ~17 GB |
| PQ (96 subvectors) + HNSW | ~0.35 KB | ~3.3 GB |
| Binary quantization + HNSW | ~0.45 KB | ~4.2 GB |
PQ offers the best compression but requires careful codebook training and introduces more approximation error. Scalar int8 is the safe default — 4x compression with minimal recall impact.
Choosing the Right Database
The decision matrix depends on five factors: scale, operational capacity, existing infrastructure, query patterns, and budget.
Decision flow based on collection size and operational preferences.
Decision Framework
| Scenario | Recommendation | Reason |
|---|---|---|
| Prototype or hackathon | Chroma | Zero config, pip install |
| Already on Postgres, <5M vectors | pgvector | No new infrastructure |
| Production RAG, 1M-50M vectors, have ops capacity | Qdrant | Best performance, most tuning control |
| Production, need hybrid BM25+vector search | Weaviate or Qdrant | Both handle it natively |
| Production, zero ops tolerance | Pinecone | Fully managed |
| Need ACID transactions with vector search | pgvector | It’s Postgres |
| Multi-tenant SaaS | Weaviate | Multi-tenancy is first-class |
| Cost-sensitive, large scale | Qdrant self-hosted | Free + aggressive quantization |
| Sparse+dense hybrid with ColBERT-style multi-vector | Qdrant | Named vectors + sparse vectors |
Migration Paths
Starting with Chroma or pgvector and outgrowing it is common and expected. The migration cost is manageable because the vector data is usually regenerated from source documents anyway (re-embedding with a potentially newer model is often desirable). The real migration cost is in the query API changes and metadata schema differences.
A pragmatic pattern: abstract the vector database behind an interface early. Store the canonical data in your primary database (Postgres, etc.) and treat the vector database as a derived index. This makes migration straightforward and gives you a rebuild path if the index gets corrupted.
# Minimal abstraction layer
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class SearchResult:
id: str
score: float
metadata: dict
class VectorStore(ABC):
@abstractmethod
def upsert(self, id: str, vector: list[float], metadata: dict) -> None: ...
@abstractmethod
def search(
self, vector: list[float], top_k: int = 10, filters: dict | None = None
) -> list[SearchResult]: ...
@abstractmethod
def delete(self, ids: list[str]) -> None: ...
What Matters Less Than People Think
Exact recall numbers (99.1% vs 99.3%): In RAG pipelines, the reranker and the LLM’s ability to synthesize from imperfect context matter more than the last fraction of a percent of retrieval recall.
Raw QPS on benchmarks: Most applications do 10-100 queries per second, not 10,000. The benchmark numbers matter for cost efficiency (fewer replicas needed), but single-digit millisecond differences in p50 latency are irrelevant to end-user experience when the LLM response takes 500ms-2s.
Which HNSW implementation is “best”: They’re all based on the same algorithm. The differences are in the surrounding systems — filtering, quantization, storage tiers, operational tooling.
What Actually Matters
Filtered search performance: This is where production systems live. Almost every real query has metadata constraints. Test with your actual filter patterns and selectivities.
Operational reliability: Can you back it up? Monitor it? Upgrade it without downtime? Handle node failures? Pinecone wins here by pushing all of that to the vendor. Self-hosted Qdrant and Weaviate require real operational investment.
Total cost at your scale: Model the full cost including compute, storage, network, and engineering time. Pinecone’s per-query pricing looks expensive at high throughput but cheap when you account for zero ops. Self-hosted Qdrant is cheap on compute but expensive in engineering hours.
Summary
HNSW is the dominant indexing algorithm across all major vector databases, offering the best recall-latency tradeoff for collections under ~100M vectors. IVF remains relevant for billion-scale collections, especially combined with product quantization. Quantization — scalar (int8), product, or binary — is not optional at scale; it’s an economic necessity that typically costs 1-3% recall.
Filtered search is the most important differentiator between databases in production. Qdrant’s adaptive filtering and Weaviate’s bitmap approach handle high-selectivity filters well; pgvector and Chroma degrade under selective filters.
For most teams building RAG applications: start with pgvector if you’re already on Postgres and have fewer than 5M vectors. Move to Qdrant or Weaviate when you need better filtered search performance, quantization, or scale past what a single Postgres instance handles. Use Pinecone if you genuinely want zero operational responsibility and can accept the pricing and vendor lock-in. Use Chroma only for prototyping.
The vectors themselves matter more than the database. A better embedding model (upgrading from text-embedding-3-small to text-embedding-3-large, or switching to a domain-tuned model) will improve retrieval quality more than any amount of index parameter tuning.
Further Reading
- ann-benchmarks.com — Standardized ANN algorithm benchmarks across multiple datasets and implementations
- Malkov & Yashunin, “Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs” (2018) — The original HNSW paper
- Qdrant vector-db-benchmark — Open-source benchmark suite comparing vector databases with filtering scenarios
- Jégou et al., “Product Quantization for Nearest Neighbor Search” (2011) — Foundational paper on PQ, still the basis for most implementations
- Faiss wiki — Meta’s similarity search library documentation, covers IVF, PQ, and composite indexes in depth
- pgvector GitHub — PostgreSQL vector extension source and documentation
- Qdrant documentation on quantization — Detailed guide on scalar, product, and binary quantization configuration
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node — Microsoft’s disk-based ANN implementation
- Weaviate hybrid search documentation — How BM25 + vector fusion works in Weaviate
- Pinecone learning center — Vendor-written but technically solid explanations of vector search concepts