Vector databases power every retrieval-augmented AI system. Here’s how HNSW, IVF, and product quantization actually work, with tradeoffs across Pinecone, Weaviate, Qdrant, pgvector, and Chroma.

How Vector Databases Actually Work

Every RAG pipeline, semantic search system, and recommendation engine built on LLMs depends on a vector database. The core operation is deceptively simple: store high-dimensional vectors, retrieve the ones closest to a query vector. The engineering behind doing that at scale — billions of vectors, sub-100ms latency, filtered by metadata — is where things get interesting and where choices start to matter.

The vector database market has consolidated somewhat since the 2023-2024 gold rush, but the surviving options have meaningfully different architectures, tradeoff profiles, and failure modes. Picking the right one requires understanding what’s happening under the hood.

The Core Problem: Nearest Neighbor Search
Distance Metrics
HNSW: The Dominant Index
IVF: Inverted File Indexes
Product Quantization: Compressing Vectors
Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond
Filtered Search: Where Things Get Hard
Storage Architecture: In-Memory vs Disk vs Hybrid
The Database Comparison
Pinecone
Weaviate
Qdrant
pgvector
Chroma
Benchmarks and Real-World Performance
Choosing the Right Database
Summary
Further Reading

The Core Problem: Nearest Neighbor Search

Given a query vector q of dimension d and a dataset of n vectors, find the k vectors closest to q. Exact brute-force search computes n distance calculations, each costing O(d) operations. For 10 million 1536-dimensional vectors (a modest OpenAI embedding collection), that’s ~15 billion floating-point operations per query.

Brute force works for small datasets. Under ~50,000 vectors, a well-optimized linear scan with SIMD instructions finishes in single-digit milliseconds. Past that threshold, approximate nearest neighbor (ANN) algorithms become necessary. They trade a small amount of recall — the fraction of true nearest neighbors actually returned — for orders-of-magnitude speedup.

Brute-force search guarantees perfect recall but doesn’t scale. ANN indexes sacrifice a few percentage points of recall for dramatic speedups.

Distance Metrics

The choice of distance metric affects both correctness and performance.

Metric	Formula	Range	Use Case
Cosine similarity	dot(a,b) / (‖a‖·‖b‖)	[-1, 1]	Text embeddings (most common)
Euclidean (L2)	√Σ(aᵢ - bᵢ)²	[0, ∞)	Image embeddings, spatial data
Dot product	Σ(aᵢ · bᵢ)	(-∞, ∞)	Pre-normalized vectors, MaxSim
Manhattan (L1)	Σ	aᵢ - bᵢ

Cosine similarity and dot product are equivalent when vectors are L2-normalized, which most text embedding models produce by default (OpenAI, Cohere, Voyage all output normalized vectors). Most databases store normalized vectors and use dot product internally since it’s cheaper — no division by norms.

The metric must match the embedding model’s training objective. Using L2 distance on vectors trained with cosine similarity still works (the ranking is preserved for normalized vectors) but mixing cosine-trained embeddings with dot product on unnormalized vectors will produce garbage results.

HNSW: The Dominant Index

Hierarchical Navigable Small World (HNSW), published by Malkov and Yashunin in 2018, is the default index type in nearly every vector database. Qdrant, Weaviate, pgvector (via hnsw_cosine_ops), Pinecone, and Chroma all use it as their primary or sole index structure.

How It Works

HNSW builds a multi-layer graph. Each layer is a navigable small world graph where nodes are vectors and edges connect nearby vectors. The top layer is the sparsest — very few vectors, connected by long-range links. Each successive layer adds more vectors with shorter-range connections. The bottom layer contains all vectors.

Search starts at the top layer, greedily navigating to the node closest to the query. It then drops to the next layer, using that node as an entry point, and repeats. At the bottom layer, it performs a more thorough beam search.

HNSW search traverses layers top-down. Upper layers provide coarse navigation; the bottom layer provides precise neighborhood exploration.

The Key Parameters

M — the number of bidirectional links per node. Higher M means better recall but more memory. Default is typically 16. Each node stores 2·M links at layer 0 and M links at higher layers.

ef_construction — the beam width during index building. Higher values produce a better-connected graph at the cost of slower indexing. Default is usually 128-200. This is a build-time parameter; once the index is built, it’s fixed.

ef_search — the beam width during query. This is the primary recall/speed knob at query time. Higher ef_search means more nodes explored, better recall, slower queries. The relationship is roughly: ef_search=50 gives ~95% recall, ef_search=200 gives ~99% recall, ef_search=500 gets close to 99.9%.

# Qdrant HNSW configuration example
from qdrant_client.models import HnswConfigDiff

hnsw_config = HnswConfigDiff(
    m=16,                    # links per node
    ef_construct=200,        # build-time beam width
    full_scan_threshold=10000,  # brute-force below this count
)

Memory Cost

HNSW’s main drawback: memory consumption. Each vector occupies d × 4 bytes (float32), plus the graph structure adds roughly M × 2 × 8 bytes per vector (for the adjacency list with 64-bit pointers). For 10M vectors at 1536 dimensions with M=16:

Vectors: 10M × 1536 × 4 = ~57 GB
Graph: 10M × 16 × 2 × 8 = ~2.4 GB
Total: ~60 GB

This is the fundamental tension. HNSW wants everything in memory. Solutions include quantization (discussed below), memory-mapped files, and tiered storage.

Time Complexity

Index building: O(n · log(n) · M · ef_construction). For 10M vectors, expect 20-60 minutes depending on hardware and parameters.

Query: O(log(n) · ef_search · d). In practice, 1-10ms per query at 95%+ recall for collections under 100M vectors.

IVF: Inverted File Indexes

IVF (Inverted File Index) takes a different approach: partition the vector space into clusters using k-means, then at query time only search the clusters closest to the query.

How It Works

Training phase: Run k-means clustering on the dataset (or a representative sample) to produce nlist centroids.
Assignment: Each vector is assigned to its nearest centroid and stored in that centroid’s posting list.
Query: Compute distances from the query to all centroids, select the nprobe closest centroids, then scan only those posting lists.

IVF partitions the vector space into clusters. Queries only scan a subset of clusters, controlled by the nprobe parameter.

Parameters

nlist — number of clusters. Rule of thumb: sqrt(n) to 4·sqrt(n). For 10M vectors: 3162 to 12649. More clusters means each posting list is shorter (faster scan) but more likely to miss relevant vectors in adjacent clusters.

nprobe — clusters scanned at query time. Higher nprobe = better recall, slower queries. nprobe=1 gives ~40-60% recall. nprobe=10 gives ~85-95%. nprobe=sqrt(nlist) is a reasonable starting point.

IVF vs HNSW Tradeoffs

Property	HNSW	IVF
Memory	High (graph + vectors)	Lower (vectors + centroids)
Build time	Slow	Moderate (k-means training)
Query latency at 95% recall	1-5ms	2-10ms
Supports on-disk storage	With mmap, not ideal	Natural fit
Update/insert cost	Moderate	Cheap (assign to cluster)
Scaling past 1B vectors	Needs careful tuning	More natural with sharding

HNSW generally wins on query quality and latency. IVF wins on memory efficiency and disk-friendliness. In practice, IVF is most often used in combination with quantization (IVF-PQ) for very large collections where HNSW’s memory requirements become prohibitive.

Product Quantization: Compressing Vectors

Product quantization (PQ) compresses vectors from 32-bit floats to compact codes, typically reducing memory by 8-32x. It’s the single most important technique for scaling vector search past the point where raw vectors fit in memory.

How It Works

Split the d-dimensional vector into m subvectors of dimension d/m. For a 1536-d vector with m=96, each subvector has 16 dimensions.
Train a codebook of k centroids (typically k=256) for each subvector using k-means. 256 centroids means each subvector maps to 1 byte (8 bits).
Encode each vector as m bytes — one centroid ID per subvector. A 1536-d float32 vector (6144 bytes) becomes 96 bytes. That’s 64x compression.
Search uses precomputed distance tables: for each query subvector, compute distances to all 256 centroids, then approximate full distances by summing subvector distances from the table.

Product quantization splits vectors into subvectors, maps each to a centroid ID, and stores only the IDs. Distance computation uses precomputed lookup tables.

Quantization Quality

PQ introduces quantization error — the distance between the original vector and the reconstructed vector (centroid values substituted back). More subvectors (m) and more centroids (k) reduce error but increase code size and codebook training time.

Typical configurations:

Dimensions	m (subvectors)	k (centroids)	Code size	Compression	Recall@10 impact
1536	96	256	96 bytes	64x	-3-5%
1536	192	256	192 bytes	32x	-1-2%
768	96	256	96 bytes	32x	-2-4%
384	48	256	48 bytes	32x	-2-3%

A common pattern: use PQ for the initial candidate retrieval, then re-rank the top candidates using exact distances against the original vectors. This “asymmetric distance computation” (ADC) approach gets near-lossless recall with PQ-level memory consumption.

Scalar Quantization

Simpler than PQ but less aggressive: map each float32 dimension to int8 (or int4). A 1536-d vector goes from 6144 bytes to 1536 bytes (int8) — 4x compression. Less lossy than PQ, less compact.

Qdrant and Weaviate both support scalar quantization as a lighter-weight alternative to PQ. For many workloads, int8 scalar quantization with rescoring gives near-identical recall to float32 at 4x memory savings.

# Qdrant scalar quantization configuration
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

scalar_config = ScalarQuantization(
    scalar=ScalarQuantizationConfig(
        type=ScalarType.INT8,
        quantile=0.99,       # clip outliers
        always_ram=True,     # keep quantized vectors in RAM
    )
)

Binary Quantization

The most extreme option: each dimension becomes 1 bit. A 1536-d vector compresses to 192 bytes (192x compression from float32). Distance computation uses POPCNT (population count) CPU instructions on XOR’d bit vectors — extremely fast.

Binary quantization works surprisingly well for high-dimensional embeddings (1024+ dimensions) from models like OpenAI’s text-embedding-3-large or Cohere’s embed-v4. Recall drops maybe 5-10% but can be recovered with rescoring. For lower-dimensional embeddings (384-d), the information loss is usually too high.

Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond

The real power comes from composition.

IVF-PQ

The classic Faiss combination. IVF partitions the space into clusters; PQ compresses the residual vectors (vector minus cluster centroid) within each cluster. Residual PQ is more accurate than raw PQ because the residuals have smaller magnitude and are more uniformly distributed.

IVF-PQ is how Meta runs similarity search on billion-scale datasets. 1B vectors at 768-d with IVF65536-PQ96 fits in ~100 GB RAM. Brute force would require ~3 TB.

HNSW + Quantization

Qdrant and Weaviate use this approach: build the HNSW graph over quantized vectors (scalar or PQ), traverse the graph using cheap quantized distance computations, then rescore the final candidates against original vectors stored on disk.

Two-phase search: fast quantized traversal for candidate generation, then full-precision rescoring for final ranking.

DiskANN

Microsoft’s DiskANN (Vamana graph) is worth mentioning: it builds an HNSW-like graph but is specifically designed for SSD-resident data. The graph structure is optimized for sequential disk reads by organizing nodes to minimize random I/O. Qdrant has incorporated some DiskANN ideas into its on-disk index implementation.

Filtered Search: Where Things Get Hard

Real applications don’t just search vectors. They search vectors with metadata filters: “find similar documents where category=‘legal’ and created_after=‘2025-01-01’.” This is where vector databases diverge most in architecture and where performance can degrade catastrophically.

The Pre-filter vs Post-filter Problem

Post-filter: Run the ANN search, get top-k candidates, then apply metadata filters. Problem: if only 1% of vectors match the filter, you need to retrieve 100x more candidates to get k results. At extreme selectivity, this degenerates to brute force.

Pre-filter: Build a candidate set from the metadata index, then run vector search within that set. Problem: the ANN index was built over the full dataset. You can’t easily restrict an HNSW traversal to a subset of nodes — the graph structure assumes all nodes are reachable.

Integrated filter: Build the ANN index to be filter-aware. This is the hard, correct approach.

Three approaches to filtered vector search, each with distinct failure modes.

How Each Database Handles Filtering

Qdrant uses a payload index (similar to a traditional database index on metadata fields) combined with a filterable HNSW. During graph traversal, it checks filter conditions on each visited node and skips non-matching nodes. When the filter is very selective (< 1% match), it falls back to a pre-filtered set with brute-force search. This adaptive approach handles edge cases well.

Weaviate uses a roaring bitmap intersection approach. It maintains inverted indexes on metadata properties, computes a bitmap of matching document IDs, then traverses the HNSW graph with the bitmap as a mask. This works well for structured filters but can struggle with complex boolean combinations.

Pinecone keeps its filtering implementation proprietary but handles metadata filters natively at the index level. Based on benchmark results, it probably uses a pre-filter + merge approach.

pgvector inherits PostgreSQL’s query planner. A query with WHERE category = 'legal' ORDER BY embedding <=> query LIMIT 10 can use either the vector index or the btree index first, depending on estimated selectivity. In practice, the planner often makes poor choices for combined vector + metadata queries, and you may need SET enable_seqscan = off or explicit index hints.

Chroma applies metadata filters after retrieving from the HNSW index (post-filter). For highly selective filters on large collections, this becomes a problem.

Storage Architecture: In-Memory vs Disk vs Hybrid

Approach	Databases	Latency	Cost	Scale limit
Fully in-memory	Chroma, early Qdrant	<1ms	$$$	RAM-bound
Memory-mapped files	Qdrant, Weaviate	1-5ms	$$	Disk-bound
Disk-native	DiskANN variants	5-20ms	$	SSD throughput
Tiered (hot/warm/cold)	Pinecone, Weaviate	Variable	$-$$$	Practically unlimited

The trend is toward memory-mapped and tiered storage. Keeping 1B float32 vectors at 1536-d in RAM requires ~5.7 TB. Even at cloud memory prices of ~$5/GB/month, that’s $28,500/month just for storage. With scalar quantization to int8, it drops to ~1.4 TB ($7,100/month). With PQ at 96 bytes per vector, ~90 GB ($450/month). Quantization isn’t optional at scale — it’s an economic requirement.

Tiered storage keeps quantized vectors and the graph in memory for fast traversal, with full-precision vectors on SSD for rescoring.

The Database Comparison

Pinecone

Architecture: Fully managed, proprietary. Runs on AWS/GCP. Serverless tier (launched 2024) bills per query and per GB stored, eliminating idle costs.

Index: Proprietary implementation. Supports cosine, euclidean, and dot product. Metadata filtering is native and generally fast.

Strengths: Zero operational overhead. Serverless pricing model is excellent for bursty workloads. Good default performance without tuning. Namespaces for logical separation within an index.

Weaknesses: No self-hosting option. Vendor lock-in. Limited observability into index internals. No way to tune HNSW parameters. The serverless tier has cold-start latency (50-200ms on first query to a dormant namespace). Pricing gets expensive at sustained high throughput — the pod-based tier is more cost-effective for steady-state workloads.

Pricing (April 2026): Serverless: $0.04/1M read units + $0.33/GB/month storage. Pod-based: s1 pods from ~$70/month per pod.

Best for: Teams that want zero infrastructure management and have variable query patterns.

# Pinecone upsert + query
from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

index.upsert(
    vectors=[
        {"id": "doc1", "values": embedding, "metadata": {"category": "legal", "year": 2025}},
    ],
    namespace="contracts"
)

results = index.query(
    vector=query_embedding,
    top_k=10,
    namespace="contracts",
    filter={"category": {"$eq": "legal"}, "year": {"$gte": 2024}}
)

Weaviate

Architecture: Open source (BSD-3), written in Go. Can run self-hosted or via Weaviate Cloud. Supports multi-tenancy natively.

Index: HNSW with configurable parameters. Supports flat, dynamic, and HNSW index types. Dynamic index switches from flat (brute-force) to HNSW automatically when a collection exceeds a threshold.

Unique features: Built-in vectorization modules — can call OpenAI, Cohere, or local models to generate embeddings at ingest time. GraphQL and REST APIs. BM25 + vector hybrid search built in. Generative modules that run LLM calls on search results.

Strengths: Integrated vectorizer pipeline reduces application code. Multi-tenancy is well-implemented for SaaS use cases (each tenant gets isolated data with efficient resource sharing). Hybrid search (BM25 + vector) is a first-class feature, not bolted on. Product quantization and binary quantization supported.

Weaknesses: Go codebase means less community contribution compared to Rust-based alternatives. Memory consumption can be high for large collections without quantization. The GraphQL API is powerful but has a learning curve. Module system adds operational complexity.

Pricing (self-hosted): Free. Weaviate Cloud: starts at ~$25/month for sandbox, production tiers from $100+/month.

# Weaviate hybrid search with filter
import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()
collection = client.collections.get("Document")

results = collection.query.hybrid(
    query="contract termination clause",
    alpha=0.7,  # 0=BM25 only, 1=vector only
    limit=10,
    filters=wvc.query.Filter.by_property("category").equal("legal"),
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
)

Qdrant

Architecture: Open source (Apache 2.0), written in Rust. Single binary, no external dependencies. gRPC and REST APIs.

Index: HNSW with extensive tuning knobs. Supports scalar (int8), product, and binary quantization. On-disk index mode backed by memory-mapped files.

Unique features: Payload indexes (secondary indexes on metadata) with multiple index types (keyword, integer, float, geo, datetime, full-text). Flexible filtering during HNSW traversal with automatic strategy selection based on filter selectivity. Sparse vector support for hybrid search (store both dense and sparse vectors in the same collection). Named vectors — store multiple vector representations per point.

Strengths: Best-in-class filtering performance. Rust implementation gives excellent memory safety and performance. Configuration is explicit and well-documented. Quantization options are the most flexible of any vector database. The Qdrant team publishes detailed benchmarks (ann-benchmarks integration). Sparse vectors + dense vectors in one collection makes hybrid search clean.

Weaknesses: Operational complexity for multi-node clusters (Raft-based consensus). Documentation, while good, assumes more infrastructure knowledge than Pinecone. No built-in vectorizer pipeline — you manage embedding generation.

Pricing (self-hosted): Free. Qdrant Cloud: from $0.016/GB/month (storage) + compute.

# Qdrant multi-vector search with quantization
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue,
    SearchParams, QuantizationSearchParams,
)

client = QdrantClient(url="http://localhost:6333")

# Search with quantization rescoring
results = client.query_points(
    collection_name="documents",
    query=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="legal"))]
    ),
    search_params=SearchParams(
        hnsw_ef=128,
        quantization=QuantizationSearchParams(
            rescore=True,       # rescore top candidates with full vectors
            oversampling=2.0,   # fetch 2x candidates before rescoring
        ),
    ),
    limit=10,
)

pgvector

Architecture: PostgreSQL extension. Version 0.8.0 (current as of April 2026). Adds vector column type and ANN index support to Postgres.

Index types: HNSW (hnsw) and IVFFlat (ivfflat). HNSW is the recommended default. IVFFlat requires a separate training step and is less flexible.

Strengths: It’s Postgres. ACID transactions. Joins across vector and relational data. Existing Postgres tooling, monitoring, backups, replication all work. No new infrastructure to manage. For applications already using Postgres, adding vector search is a one-migration operation.

Weaknesses: Performance ceiling. pgvector’s HNSW implementation is single-threaded for index builds (as of 0.8.0). Filtered search relies on the Postgres query planner, which doesn’t always make optimal decisions for combined vector + metadata queries. No built-in quantization (though there’s a separate pgvector-quantize extension). Memory management is Postgres’s shared_buffers model, which wasn’t designed for vector workloads. At 10M+ vectors, purpose-built vector databases are 2-5x faster.

Scaling: pgvector tops out around 5-10M vectors per Postgres instance with acceptable latency. Beyond that, you either shard across multiple Postgres instances (complex) or move to a purpose-built system. Managed services like Neon and Supabase offer pgvector with auto-scaling that pushes this ceiling higher.

-- pgvector setup and query
CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    category TEXT,
    embedding vector(1536)
);

-- Build HNSW index
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Create a btree index for filtered queries
CREATE INDEX ON documents (category);

-- Combined vector + metadata query
SELECT id, content, embedding <=> $1::vector AS distance
FROM documents
WHERE category = 'legal'
ORDER BY embedding <=> $1::vector
LIMIT 10;

Chroma

Architecture: Open source (Apache 2.0), Python-native. Designed as the “SQLite of vector databases” — embedded mode runs in-process. Also supports client-server mode.

Index: HNSW via hnswlib. Simple configuration.

Strengths: The simplest possible getting-started experience. pip install chromadb, create a collection, add documents. Built-in embedding functions (wraps OpenAI, Cohere, etc.). Excellent for prototyping, notebooks, and small applications. The embedded mode means no network hops — sub-millisecond for small collections.

Weaknesses: Not designed for production scale. No quantization. Metadata filtering is post-filter (performance degrades with selective filters on large collections). Limited index tuning. No built-in replication or sharding. The hosted Chroma Cloud offering is still maturing. Performance falls off sharply past 1M vectors.

Best for: Prototyping, development, small applications (<500K vectors), local-first applications.

# Chroma — minimal setup
import chromadb

client = chromadb.Client()  # in-memory
# client = chromadb.PersistentClient(path="./chroma_data")  # persistent

collection = client.create_collection(
    name="documents",
    metadata={"hnsw:M": 16, "hnsw:construction_ef": 200},
)

collection.add(
    ids=["doc1", "doc2"],
    documents=["contract termination clause...", "payment terms..."],
    metadatas=[{"category": "legal"}, {"category": "finance"}],
)

results = collection.query(
    query_texts=["termination policy"],
    n_results=5,
    where={"category": "legal"},
)

Benchmarks and Real-World Performance

The most reliable public benchmark is ann-benchmarks.com, but it tests raw ANN performance without metadata filtering, persistence, or concurrent queries — all of which matter in production.

Qdrant publishes comparative benchmarks at vector-db-benchmark that include filtering scenarios. Taking their results with appropriate salt (they’re benchmarking their own product), the general patterns are consistent with independent testing:

Unfiltered Search (1M vectors, 768-d, 99% recall)

Database	QPS (single thread)	QPS (parallel)	p99 latency
Qdrant	~800	~4,500	3ms
Weaviate	~600	~3,200	5ms
pgvector	~300	~900	12ms
Chroma	~400	~1,500	8ms
Pinecone (serverless)	N/A (managed)	~2,000	15ms

Numbers are approximate, from multiple benchmark sources as of early 2026. Actual performance depends on hardware, configuration, and data distribution.

Filtered Search (1M vectors, 1% selectivity filter)

This is where the differences are starkest:

Database	QPS (parallel)	p99 latency	Notes
Qdrant	~3,800	5ms	Adaptive filter strategy
Weaviate	~2,200	9ms	Roaring bitmap filtering
pgvector	~150	45ms	Query planner often suboptimal
Chroma	~200	35ms	Post-filter degrades here
Pinecone	~1,500	20ms	Managed, limited tuning

Filtered search performance diverges as filter selectivity increases. The 1% selectivity threshold is where architectural differences become most visible.

Memory Efficiency (10M vectors, 1536-d)

Configuration	Memory per vector	Total for 10M
Float32, no quantization	~6.4 KB	~60 GB
Scalar int8 + HNSW	~1.8 KB	~17 GB
PQ (96 subvectors) + HNSW	~0.35 KB	~3.3 GB
Binary quantization + HNSW	~0.45 KB	~4.2 GB

PQ offers the best compression but requires careful codebook training and introduces more approximation error. Scalar int8 is the safe default — 4x compression with minimal recall impact.

Choosing the Right Database

The decision matrix depends on five factors: scale, operational capacity, existing infrastructure, query patterns, and budget.

Decision flow based on collection size and operational preferences.

Decision Framework

Scenario	Recommendation	Reason
Prototype or hackathon	Chroma	Zero config, pip install
Already on Postgres, <5M vectors	pgvector	No new infrastructure
Production RAG, 1M-50M vectors, have ops capacity	Qdrant	Best performance, most tuning control
Production, need hybrid BM25+vector search	Weaviate or Qdrant	Both handle it natively
Production, zero ops tolerance	Pinecone	Fully managed
Need ACID transactions with vector search	pgvector	It’s Postgres
Multi-tenant SaaS	Weaviate	Multi-tenancy is first-class
Cost-sensitive, large scale	Qdrant self-hosted	Free + aggressive quantization
Sparse+dense hybrid with ColBERT-style multi-vector	Qdrant	Named vectors + sparse vectors

Migration Paths

Starting with Chroma or pgvector and outgrowing it is common and expected. The migration cost is manageable because the vector data is usually regenerated from source documents anyway (re-embedding with a potentially newer model is often desirable). The real migration cost is in the query API changes and metadata schema differences.

A pragmatic pattern: abstract the vector database behind an interface early. Store the canonical data in your primary database (Postgres, etc.) and treat the vector database as a derived index. This makes migration straightforward and gives you a rebuild path if the index gets corrupted.

# Minimal abstraction layer
from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class SearchResult:
    id: str
    score: float
    metadata: dict

class VectorStore(ABC):
    @abstractmethod
    def upsert(self, id: str, vector: list[float], metadata: dict) -> None: ...

    @abstractmethod
    def search(
        self, vector: list[float], top_k: int = 10, filters: dict | None = None
    ) -> list[SearchResult]: ...

    @abstractmethod
    def delete(self, ids: list[str]) -> None: ...

What Matters Less Than People Think

Exact recall numbers (99.1% vs 99.3%): In RAG pipelines, the reranker and the LLM’s ability to synthesize from imperfect context matter more than the last fraction of a percent of retrieval recall.

Raw QPS on benchmarks: Most applications do 10-100 queries per second, not 10,000. The benchmark numbers matter for cost efficiency (fewer replicas needed), but single-digit millisecond differences in p50 latency are irrelevant to end-user experience when the LLM response takes 500ms-2s.

Which HNSW implementation is “best”: They’re all based on the same algorithm. The differences are in the surrounding systems — filtering, quantization, storage tiers, operational tooling.

What Actually Matters

Filtered search performance: This is where production systems live. Almost every real query has metadata constraints. Test with your actual filter patterns and selectivities.

Operational reliability: Can you back it up? Monitor it? Upgrade it without downtime? Handle node failures? Pinecone wins here by pushing all of that to the vendor. Self-hosted Qdrant and Weaviate require real operational investment.

Total cost at your scale: Model the full cost including compute, storage, network, and engineering time. Pinecone’s per-query pricing looks expensive at high throughput but cheap when you account for zero ops. Self-hosted Qdrant is cheap on compute but expensive in engineering hours.

Summary

HNSW is the dominant indexing algorithm across all major vector databases, offering the best recall-latency tradeoff for collections under ~100M vectors. IVF remains relevant for billion-scale collections, especially combined with product quantization. Quantization — scalar (int8), product, or binary — is not optional at scale; it’s an economic necessity that typically costs 1-3% recall.

Filtered search is the most important differentiator between databases in production. Qdrant’s adaptive filtering and Weaviate’s bitmap approach handle high-selectivity filters well; pgvector and Chroma degrade under selective filters.

For most teams building RAG applications: start with pgvector if you’re already on Postgres and have fewer than 5M vectors. Move to Qdrant or Weaviate when you need better filtered search performance, quantization, or scale past what a single Postgres instance handles. Use Pinecone if you genuinely want zero operational responsibility and can accept the pricing and vendor lock-in. Use Chroma only for prototyping.

The vectors themselves matter more than the database. A better embedding model (upgrading from text-embedding-3-small to text-embedding-3-large, or switching to a domain-tuned model) will improve retrieval quality more than any amount of index parameter tuning.

How Vector Databases Actually Work

Table of Contents

The Core Problem: Nearest Neighbor Search

Distance Metrics

HNSW: The Dominant Index

How It Works

The Key Parameters

Memory Cost

Time Complexity

IVF: Inverted File Indexes

How It Works

Parameters

IVF vs HNSW Tradeoffs

Product Quantization: Compressing Vectors

How It Works

Quantization Quality

Scalar Quantization

Binary Quantization

Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond

IVF-PQ

HNSW + Quantization

DiskANN

Filtered Search: Where Things Get Hard

The Pre-filter vs Post-filter Problem

How Each Database Handles Filtering

Storage Architecture: In-Memory vs Disk vs Hybrid

The Database Comparison

Pinecone

Weaviate

Qdrant

pgvector

Chroma

Benchmarks and Real-World Performance

Unfiltered Search (1M vectors, 768-d, 99% recall)

Filtered Search (1M vectors, 1% selectivity filter)

Memory Efficiency (10M vectors, 1536-d)

Choosing the Right Database

Decision Framework

Migration Paths

What Matters Less Than People Think

What Actually Matters

Summary

Further Reading