Create collection with scalar quantization 2026-04-07T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

Create collection with scalar quantization

The post you bookmark. One topic, covered end to end.

Vector databases power every RAG pipeline and semantic search system in production today, but the gap between “spin up a hosted index” and “understand why recall dropped 12% after scaling to 50M vectors” is where most teams get stuck.

Table of Contents


Every vector database solves one problem: given a query vector of d dimensions, find the k most similar vectors from a collection of n vectors. Exact nearest neighbor search requires comparing the query against every vector in the dataset — O(n × d) per query. At 1M vectors with 1536 dimensions (OpenAI’s text-embedding-3-small), that’s ~6 billion floating-point operations per query. Manageable on a single machine with brute force. At 100M vectors, it’s not.

Approximate Nearest Neighbor (ANN) algorithms trade a small amount of recall for orders-of-magnitude speedup. A well-tuned HNSW index at 100M vectors returns results in under 10ms with 95-99% recall — meaning 95-99% of the true nearest neighbors appear in the result set.

Brute ForceO(n × d) per query100% recallANN IndexO(log n) to O(√n)95-99% recallTop-k Results exact, slowapproximate, fast

Brute force guarantees perfect recall but scales linearly; ANN indexes sacrifice a small amount of accuracy for sublinear query times.

The entire field of vector databases is about the data structures and algorithms that make ANN practical at scale, plus the operational concerns — filtering, updates, persistence, replication — that turn an index into a database.

Distance Metrics: Choosing the Right Similarity Function

The choice of distance metric affects both correctness and index construction. The three that matter:

Cosine similarity measures the angle between vectors, ignoring magnitude. Most embedding models (OpenAI, Cohere, Voyage, BGE) normalize their outputs to unit length, which means cosine similarity reduces to a dot product. If your vectors are already L2-normalized, cosine and dot product are equivalent — use dot product, it’s one fewer operation.

Euclidean distance (L2) measures straight-line distance. For normalized vectors, minimizing L2 distance produces the same ranking as maximizing cosine similarity: ||a - b||² = 2 - 2·cos(a, b) when ||a|| = ||b|| = 1. For non-normalized vectors (some custom models, learned representations from autoencoders), L2 and cosine diverge.

Inner product (dot product) is equivalent to cosine when vectors are normalized. When they’re not, it weights both direction and magnitude — useful when magnitude carries information (e.g., a popularity-weighted embedding where more important items have longer vectors).

MetricWhen to UseNormalized Equivalent
Cosine similarityDefault for most embedding modelsSame as dot product
Euclidean (L2)Non-normalized vectors, spatial dataSame ranking as cosine when normalized
Dot productPre-normalized vectors, max-inner-product searchSame as cosine when normalized

Practical rule: if using any standard embedding API (OpenAI, Cohere, Voyage, Jina, BGE), pick dot product or cosine — they’ll produce identical results because the API returns normalized vectors. Check your model’s documentation. Voyage v3 and Jina v3 both normalize by default. OpenAI’s text-embedding-3-large normalizes by default but supports a dimensions parameter that truncates before re-normalizing, so the shortened vectors remain unit-length.

HNSW: The Algorithm Behind Most Vector Databases

Hierarchical Navigable Small World (HNSW) is the default index type in Qdrant, Weaviate, pgvector (since v0.5.0), Chroma, and the index behind most Pinecone pod configurations. Understanding it well is the single highest-leverage thing for tuning vector search performance.

The Intuition

HNSW builds a multi-layer graph. The bottom layer (layer 0) contains every vector as a node, connected to its approximate nearest neighbors. Upper layers contain progressively fewer nodes — each node at layer l has a probability of appearing at layer l+1, typically 1/M where M is a parameter.

Search starts at the top layer, greedily navigating to the closest node to the query, then drops down a layer and repeats. The top layers provide “express lanes” across the graph; the bottom layer provides fine-grained local search.

Layer 2~4 nodeslong-range jumpsLayer 1~16 nodesmedium hopsLayer 0all N nodeslocal searchNearestNeighbors descenddescendtop-k

Search enters at the top layer and greedily descends, narrowing the search scope at each level.

The Two Parameters That Matter

M (max connections per node): Each node in layer 0 connects to up to 2*M neighbors; nodes in upper layers connect to up to M neighbors. Higher M means better recall and higher memory usage. Default in most implementations: 16. At M=16, each vector stores ~128 neighbor pointers (64-bit IDs) in layer 0, adding ~1KB of graph overhead per vector. At 10M vectors, that’s ~10GB of graph structure alone, on top of the raw vector data.

ef_construction (beam width during build): How many candidates the algorithm tracks while inserting a new node. Higher values produce a better graph but take longer to build. Default: 128-200. Diminishing returns above 400 for most workloads.

ef_search (beam width during query): The runtime equivalent. This is the primary latency/recall knob. At ef_search=50, queries are fast but recall might be 92%. At ef_search=200, recall hits 99% but latency doubles or triples. This is the parameter to tune at query time — it requires no index rebuild.

ParameterDefaultTuning ImpactRebuild Required?
M16Memory ↑, recall ↑, build time ↑Yes
ef_construction128-200Build quality ↑, build time ↑Yes
ef_search100-150Recall ↑, latency ↑No

Memory Footprint

For 1M vectors at 1536 dimensions (float32):

  • Raw vectors: 1M × 1536 × 4 bytes = ~5.7 GB
  • HNSW graph (M=16): ~1-1.5 GB
  • Total: ~7-7.2 GB

This is why HNSW is often called “memory-hungry.” The graph overhead is relatively modest, but the requirement to hold all vectors in memory (or memory-mapped storage) means costs scale linearly with collection size. At 100M vectors with 1536-d float32, that’s ~570 GB of raw vector data before graph overhead.

Build Time

HNSW insertion is O(log n) per vector (traversing layers) times O(M × ef_construction) work per layer. For 10M 1536-d vectors on a modern 32-core machine, expect 30-90 minutes for a full index build. Incremental insertions are fast — single vectors insert in <1ms — but inserting many vectors into a pre-built graph produces a slightly worse graph than building from scratch. Most databases handle this transparently with background optimization passes.

IVF: Inverted File Indexes

IVF (Inverted File Index) partitions the vector space into nlist clusters using k-means, then assigns each vector to its nearest cluster centroid. At query time, only the nprobe closest clusters are searched.

Query VectorCompare tonlist centroidsSearch nprobeclosest clustersRank candidatesreturn top-k distance calcselect closestbrute force

IVF narrows the search space by comparing the query to cluster centroids, then only searching the most promising partitions.

Key Parameters

nlist (number of clusters): Typically sqrt(n) to 4*sqrt(n). For 1M vectors, nlist=1000 to 4000. Too few clusters → each cluster is large, search is slow. Too many → cluster boundaries cut through dense regions, hurting recall.

nprobe (clusters searched at query time): The recall/latency knob. nprobe=1 is fastest, nprobe=nlist is equivalent to brute force. Typical range: 5-50. At nprobe=10 with nlist=1024, you’re searching ~1% of the data.

IVF’s advantage over HNSW: lower memory overhead (no graph structure, just centroids and inverted lists) and faster index construction. Its disadvantage: generally lower recall at equivalent query latency. IVF also requires an upfront training step (k-means clustering), which is expensive for large datasets — O(nlist × n × d × iterations).

IVF is the backbone of FAISS (Meta’s library) and shows up as an option in most databases. It’s particularly useful when combined with quantization.

Product Quantization: Trading Precision for Memory

Product Quantization (PQ) compresses vectors by splitting each d-dimensional vector into m subvectors, then replacing each subvector with the index of its nearest centroid from a learned codebook. Instead of storing 1536 float32 values (6144 bytes), PQ with m=96 subquantizers and 8-bit codes stores 96 bytes per vector — a 64x compression.

Original Vector1536 dims × 4 bytes= 6144 bytesSplit into 96subvectors(16 dims each)Replace each withcodebook index(1 byte each)PQ Code96 bytes(64x smaller) partitionlookupstore

Product quantization splits vectors into subvectors and replaces each with a codebook index, achieving 64x compression at the cost of distance approximation accuracy.

How Distance Computation Works with PQ

At query time, the database precomputes a lookup table: for each of the m subquantizers, it calculates the distance from the query’s subvector to all 256 centroids (for 8-bit codes). This produces an m × 256 table. Distance to any database vector is then approximated by summing m table lookups — extremely fast, especially with SIMD.

The approximation introduces error. PQ distance is a lossy compression of the true distance. Recall at 95% with PQ typically requires 2-3x overretrieval followed by reranking with exact distances (stored separately or recomputed from original vectors).

Scalar Quantization

A simpler alternative: scalar quantization (SQ) reduces each float32 to int8, yielding a 4x compression with much less distortion than PQ. Qdrant, Weaviate, and pgvector all support SQ. At 1536 dimensions, SQ reduces per-vector storage from 6144 bytes to 1536 bytes. The recall loss is typically <1% for well-distributed embeddings.

Binary quantization goes further — 1 bit per dimension. A 1536-d vector becomes 192 bytes. Useful for first-pass candidate retrieval with reranking, but standalone recall drops to 70-85% for typical embedding distributions.

MethodCompressionMemory (1M × 1536d)Typical Recall Loss
None (float32)1x5.7 GB0%
Scalar (int8)4x1.4 GB<1%
Product Quantization32-64x90-180 MB3-8% without reranking
Binary48x120 MB15-30% without reranking

Combining Techniques: IVF-PQ, HNSW-PQ, and Beyond

Real-world indexes combine algorithms. The most common compositions:

IVF-PQ: IVF for coarse partitioning, PQ for compressed storage within each partition. This is FAISS’s bread-and-butter index (faiss.IndexIVFPQ). At 1B vectors, IVF-PQ can fit in ~100GB RAM with reasonable recall. The workflow: find the nearest nprobe centroids, scan PQ codes within those clusters, rerank top candidates with exact vectors.

HNSW + SQ/PQ: Use HNSW graph navigation with quantized vectors for distance computation during search, then optionally rerank with full-precision vectors. Qdrant calls this “quantization” in their configuration; Weaviate supports PQ compression on HNSW indexes.

IVF-HNSW: Use HNSW to find the nearest centroids (replacing the brute-force centroid search in standard IVF). Useful when nlist is very large (>100k).

QueryCoarse Search(HNSW graph orIVF centroids)PQ Distance Scan(compressed vectors)Rerank Top-100(full-precision)Final Top-k candidatesoverretrieveexact distance

Composite indexes use coarse search to find candidate partitions, PQ for fast approximate scoring, and optional reranking with full-precision vectors for final accuracy.

The practical takeaway: if your dataset fits in memory with float32 or int8 vectors, pure HNSW with scalar quantization is the simplest path. When memory is the constraint — datasets exceeding available RAM — IVF-PQ or HNSW+PQ becomes necessary.

Flat Search and Brute Force: When Simplicity Wins

For collections under ~100k vectors, brute-force search with SIMD-optimized distance computation often outperforms ANN indexes. A 50k × 1536 float32 matrix fits in ~290 MB. Brute-force scan with AVX-512 on a modern CPU completes in <5ms. The recall is 100%, there’s no index to build or tune, and insertions/deletions are trivial.

pgvector’s default behavior without an index is brute-force scan. For small collections, this is the correct choice. Adding an HNSW index to a 10k-row table adds build time, memory, and configuration surface area with negligible latency improvement.

The crossover point depends on dimensionality and hardware, but as a rough guide:

Collection SizeRecommended Approach
<50k vectorsBrute force (no index)
50k-1M vectorsHNSW with scalar quantization
1M-100M vectorsHNSW + SQ, or IVF-PQ
>100M vectorsIVF-PQ with HNSW coarse quantizer, or distributed sharding

Filtering: The Hard Part Nobody Warns You About

Most real queries aren’t pure vector search. They include metadata filters: “find similar documents where tenant_id = ‘acme’ and created_after = ‘2025-01-01’.” This is where vector databases diverge sharply in capability and where performance often collapses.

Three Filtering Strategies

Pre-filtering: Apply metadata filters first, then search only matching vectors. Guarantees that all results match the filter. Problem: if the filter is very selective (matches 0.1% of vectors), the ANN index can’t be used efficiently — the graph or cluster structure was built over the full dataset, and the filter-matching vectors might be scattered across the graph.

Post-filtering: Run ANN search first, retrieve more candidates than needed, then filter. Fast, but if the filter is selective, most candidates are discarded. At 0.1% filter selectivity, retrieving top-10 results might require scanning top-10,000 before finding 10 matches.

Integrated filtering: Modify the ANN search to respect filters during traversal. HNSW graph traversal skips non-matching nodes. This is the approach Qdrant and Weaviate use. It preserves index efficiency but can degrade recall when filters are very selective — the graph’s connectivity assumptions break down when most nodes are invisible.

Pre-filter(filter → search subset)Post-filter(search → filter results)Integrated(filter during traversal)tradeoff_pretradeoff_posttradeoff_int exact filter matchslow on selective filtersfast searchmay miss resultsbest balancerecall degrades athigh selectivity

Each filtering strategy has a failure mode — the choice depends on filter selectivity and acceptable recall loss.

The Selectivity Cliff

When a metadata filter matches less than ~1-5% of the dataset, all three strategies struggle. Pre-filtering produces a tiny search set where ANN indexes add no value. Post-filtering wastes compute. Integrated filtering navigates a sparse graph.

The practical solution for highly selective filters: maintain separate indexes per partition. If tenant_id produces natural partitions of 10k-1M vectors each, building per-tenant indexes avoids the selectivity problem entirely. Pinecone’s “namespaces” and Qdrant’s “payload indexes” with partitioning address this. Weaviate added multi-tenancy with per-tenant isolation in 2024, which maps well to this pattern.

Database Comparison {#database-comparison}

Pinecone

Architecture: Fully managed, serverless and pod-based tiers. The serverless tier (launched 2024) charges per read/write/storage with no idle cost. Pod-based tier (p1, p2, s1 pods) offers more predictable performance.

Index type: Proprietary. Pinecone hasn’t publicly documented whether they use HNSW, IVF, or something custom in serverless. Pod-based likely uses a modified HNSW. The “sparse-dense” hybrid search suggests separate sparse and dense indexes with score fusion.

Filtering: Integrated metadata filtering. Supports exact match, range, and $in operators. Performance degrades on highly selective filters but is better than naive post-filtering.

Strengths: Zero operational overhead. The serverless pricing model works well for bursty workloads. Sparse-dense hybrid search is built in.

Weaknesses: No self-hosted option. No access to index parameters (M, ef_search) — Pinecone manages these. Vendor lock-in is real. Debugging recall issues requires support tickets, not parameter tuning.

Pricing (serverless, as of early 2026): ~$0.033/1M read units, $2.00/1M write units, $0.033/GB/month storage. A 10M-vector collection with 1536-d vectors costs roughly $20-40/month in storage plus per-query costs.

Qdrant

Architecture: Written in Rust. Runs as a single binary or distributed cluster. Open-source (Apache 2.0) with a managed cloud offering.

Index type: HNSW with configurable M and ef_construction. Supports scalar quantization (int8), product quantization, and binary quantization. Quantization can be configured per-collection.

Filtering: Payload indexes (analogous to secondary indexes in traditional databases) with integrated filtering during HNSW traversal. Supports numeric, keyword, geo, datetime, and full-text payload types. The filtering implementation is notably well-engineered — Qdrant maintains a payload index that integrates with the HNSW traversal, avoiding the worst-case selectivity problems.

Strengths: Best-in-class filtering performance. Fine-grained tuning (M, ef, quantization). Multi-tenancy support. The gRPC API is fast. Snapshot-based backups.

Weaknesses: Distributed mode requires manual shard management (improved in recent versions but not as turnkey as Pinecone). Memory-mapped storage helps with large datasets but introduces I/O latency.

Pricing (cloud): Starts at ~$25/month for a small node. Self-hosted: free.

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, Filter, FieldCondition, MatchValue

client = QdrantClient("localhost", port=6333)

# Create collection with scalar quantization
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config={"scalar": {"type": "int8", "quantile": 0.99, "always_ram": True}},
    hnsw_config={"m": 16, "ef_construct": 128},
)

# Search with metadata filter
results = client.query_points(
    collection_name="documents",
    query=[0.1] * 1536,  # query vector
    query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value="acme"))]),
    limit=10,
    search_params={"hnsw_ef": 128},  # runtime ef_search
)

Weaviate

Architecture: Written in Go. Modular: separate vectorizer modules can call OpenAI, Cohere, etc. at ingest time. Open-source (BSD-3) with a managed cloud.

Index type: HNSW (custom implementation). Added PQ compression in 2024. Flat index option for small collections.

Filtering: Inverted index on properties combined with HNSW search. Weaviate builds a roaring bitmap per filterable property and intersects it with the HNSW candidate set. Pre-filtering is the default; it switches to post-filtering when the filter is too selective and pre-filtering would be too slow.

Strengths: Built-in vectorization (no need to manage embedding API calls separately). GraphQL and REST APIs. Multi-tenancy is well-implemented — each tenant gets isolated storage and index resources. Hybrid search (BM25 + vector) is native.

Weaknesses: Go’s garbage collector can cause latency spikes under heavy load (measurably worse p99 than Qdrant’s Rust-based engine at high QPS). The module system adds complexity. Memory consumption tends to be higher than Qdrant for equivalent workloads.

Pricing (cloud): Serverless tier starts free with pay-per-use. Dedicated clusters from ~$25/month.

pgvector

Architecture: PostgreSQL extension. Vectors are stored as a column type. Indexes are built using standard CREATE INDEX syntax.

Index type: HNSW (added in v0.5.0, mid-2023) and IVFFlat. HNSW is the recommended choice for most workloads. As of pgvector 0.8.x (current), HNSW supports int8 and binary quantization via halfvec and bit types.

Filtering: This is pgvector’s killer feature. Metadata filtering uses PostgreSQL’s query planner — WHERE tenant_id = 'acme' combined with ORDER BY embedding <=> query_vector LIMIT 10 leverages PostgreSQL’s ability to combine B-tree and HNSW indexes. Partial indexes work: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WHERE tenant_id = 'acme'.

Strengths: No new infrastructure. If you’re already running PostgreSQL, pgvector adds vector search with a CREATE EXTENSION command. Full SQL expressiveness for filtering. Transactional consistency with your application data. Partial indexes for per-tenant isolation.

Weaknesses: Performance ceiling. pgvector’s HNSW implementation is single-threaded for index builds (parallel build landed in 0.8.0 but is still slower than Qdrant/FAISS). At >5M vectors, query latency and memory usage become painful compared to purpose-built engines. No built-in replication of vector indexes across read replicas (the index must be built on each replica). No native quantization beyond halfvec (half-precision float16) and bit types.

Pricing: Whatever your PostgreSQL host costs. On AWS RDS, a db.r7g.xlarge (32 GB RAM) runs ~$350/month and comfortably handles 2-5M 1536-d vectors.

-- pgvector: create table, index, and query with filtering
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    tenant_id text NOT NULL,
    content text,
    embedding vector(1536),
    created_at timestamptz DEFAULT now()
);

-- B-tree index for filtering
CREATE INDEX idx_tenant ON documents (tenant_id);

-- HNSW index with custom parameters
CREATE INDEX idx_embedding ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

-- Query: combined metadata filter + vector search
SET hnsw.ef_search = 100;

SELECT id, content, embedding <=> $1::vector AS distance
FROM documents
WHERE tenant_id = 'acme' AND created_at > '2025-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 10;

Chroma

Architecture: Python-native. Uses hnswlib under the hood for vector indexing and DuckDB+Parquet or SQLite for metadata. Open-source (Apache 2.0). Designed as an “AI-native” embedding database.

Index type: HNSW via hnswlib. Limited configurability — M and ef_construction are exposed but the defaults are reasonable for small-to-medium collections.

Filtering: Metadata filtering via where clauses. Implemented as post-filtering on the HNSW results. At selective filters, this means poor recall or requiring large overretrieval.

Strengths: Simplest possible API. Five lines to ingest and query. Excellent for prototyping, local development, and small-scale production (<1M vectors). In-process mode (no server needed) is genuinely useful for notebook workflows and testing.

Weaknesses: Not designed for large-scale production. No built-in clustering, replication, or sharding. Filtering is post-filter only, which breaks at scale. hnswlib’s index is fully in-memory with no memory-mapped option. The Python process becomes the bottleneck at high QPS.

Pricing: Free (open-source). Chroma Cloud (hosted, in GA as of late 2025) has a free tier and paid plans starting at ~$30/month.

Comparison Table

FeaturePineconeQdrantWeaviatepgvectorChroma
LanguageProprietaryRustGoC (PG ext)Python
Self-hostedNoYesYesYesYes
Max practical scaleBillions100M+100M+~5-10M~1M
HNSW tuningNoFullLimitedFullLimited
QuantizationUnknownSQ, PQ, BinaryPQ, BQhalfvec, bitNone
Filtering qualityGoodExcellentGoodExcellent (SQL)Poor
Hybrid searchYes (sparse-dense)Yes (sparse vectors)Yes (BM25 + vector)Manual (tsvector + vector)No
Multi-tenancyNamespacesPayload partitionsNative tenant isolationPartial indexesCollections
Managed cloudYes (only)YesYesRDS/Aurora/SupabaseYes (beta)
Minimum cost~$0/mo (serverless idle)$25/mo (cloud)Free tier (cloud)Existing PG costFree
Prototype(<100k vectors)Small Production(100k-5M vectors)Large Production(5M-100M+ vectors)chroma_pgvecqdrant_pgvecqdrant_pine Chroma or pgvectorQdrant, pgvector,or PineconeQdrant, Weaviate,or Pinecone

Database choice depends primarily on scale and whether you’re already running PostgreSQL.

Architecture Patterns in Production

Pattern 1: pgvector for Everything Under 5M Vectors

If the application already uses PostgreSQL and the vector collection stays under 5M rows, pgvector eliminates an entire infrastructure dependency. Vectors live alongside relational data. JOINs work. Transactions are ACID. The developer experience is “just SQL.”

ApplicationPostgreSQL+ pgvector(vectors + metadata+ app data)Embedding API(OpenAI, Cohere) SQL queriesembed textstore vectors

The simplest production architecture: PostgreSQL handles both application data and vector search.

The failure mode: query latency creeps up as the collection grows. HNSW index build times (15-45 minutes for 5M 1536-d vectors) block vacuum operations. At this point, extract vector search into a dedicated engine.

Pattern 2: Dedicated Vector DB + Application DB

The standard pattern at scale. Application data stays in PostgreSQL/MySQL. Vectors and associated metadata live in Qdrant, Weaviate, or Pinecone. The application writes to both on ingest and queries the vector DB for retrieval.

ApplicationPostgreSQL(app data, full docs)Qdrant / Pinecone(vectors + metadata)Embedding API CRUDembedupsertsearchfetch full docsby ID

Separating vector search from the application database introduces consistency challenges but enables independent scaling.

The consistency challenge: if a document is updated in the application DB but the vector DB hasn’t been re-indexed yet, search returns stale results. Solutions: (1) synchronous dual-write (slow, fragile), (2) change data capture (CDC) pipeline from the application DB to trigger re-embedding and vector upsert, or (3) accept eventual consistency with a bounded lag (the most common choice).

Pattern 3: Hybrid Search with Reranking

Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search runs both and merges results.

User QuerySparse Search(BM25 / keyword)Dense Search(vector similarity)Score Fusion(RRF or weighted)Cross-EncoderRerankerFinal Results keyword hitssemantic hitsmerged candidatestop-k

Hybrid search combines keyword and vector retrieval, then uses a reranker to produce the final ranking.

Reciprocal Rank Fusion (RRF) is the simplest merge strategy: score = Σ 1/(k + rank_i) across result lists, where k is typically 60. It works well without tuning. Weighted linear combination requires calibrating the relative weights of sparse vs. dense scores, which vary by query type.

The reranker matters more than the merge strategy. A cross-encoder reranker (Cohere Rerank v3, Jina Reranker v2, or a fine-tuned bge-reranker-v2-m3) evaluates query-document pairs jointly. Rerankers are expensive — 10-50ms per candidate pair — so they’re applied to the merged top-50 or top-100 candidates.

Weaviate supports hybrid search natively (alpha parameter controls BM25 vs. vector weighting). Pinecone supports sparse-dense vectors in the same index. For Qdrant and pgvector, hybrid search requires running two queries and merging in application code.

Pattern 4: Tiered Storage for Cost Optimization

At 100M+ vectors, keeping everything in memory-backed HNSW is expensive. A tiered approach: hot data (recent, frequently accessed) in a fast HNSW index, cold data in an IVF-PQ or disk-backed index.

Qdrant supports on-disk storage with memory-mapped files. Queries to memory-mapped indexes are ~2-5x slower than in-memory but cost 10x less (disk vs. RAM). The configurable on_disk flag per collection and memmap_threshold parameter control when data moves to disk.

Benchmarking Your Own Workload

Public benchmarks (ANN Benchmarks, VectorDBBench) test synthetic workloads with uniform distributions. Real embeddings from production data have different clustering properties, and real queries include filters. Running benchmarks on your own data with your own query patterns is non-negotiable before choosing a database.

A Practical Benchmarking Protocol

  1. Sample 100k-1M vectors from your actual data (or all of it if smaller). Include representative metadata fields.
  2. Generate ground truth: brute-force top-100 for 1000 random query vectors. This is your recall baseline.
  3. Measure recall@k: For each database/configuration, run the same 1000 queries and compute |intersection(ann_results, ground_truth)| / k. Measure at k=10 and k=100.
  4. Measure with filters: Repeat with metadata filters at 50%, 10%, 1%, and 0.1% selectivity. This is where databases diverge most.
  5. Measure under load: Run queries at 10, 50, 100, 500 QPS. Measure p50, p95, p99 latency. Purpose-built databases (Qdrant, Weaviate) should sustain <20ms p95 at 100 QPS for 1M vectors; pgvector will likely show higher p99.
  6. Measure index build time and memory: Track RSS memory and wall-clock build time. These determine infrastructure cost.
import numpy as np
from time import perf_counter

def measure_recall(ground_truth_ids, ann_result_ids, k):
    """Compute recall@k: fraction of true top-k neighbors found by ANN."""
    recalls = []
    for gt, ann in zip(ground_truth_ids, ann_result_ids):
        gt_set = set(gt[:k])
        ann_set = set(ann[:k])
        recalls.append(len(gt_set & ann_set) / k)
    return np.mean(recalls)

# Example: brute-force ground truth with numpy
def brute_force_topk(query_vectors, db_vectors, k=100):
    """Compute exact top-k for recall measurement."""
    # For cosine similarity with normalized vectors, use dot product
    similarities = query_vectors @ db_vectors.T  # (n_queries, n_db)
    top_k_indices = np.argpartition(-similarities, kth=k, axis=1)[:, :k]
    # Sort the top-k by similarity
    for i in range(len(query_vectors)):
        order = np.argsort(-similarities[i, top_k_indices[i]])
        top_k_indices[i] = top_k_indices[i, order]
    return top_k_indices

What “Good” Looks Like

For a RAG system feeding results to an LLM: recall@10 ≥ 0.95 is the floor. Below that, the LLM frequently misses relevant context. p95 query latency under 50ms keeps the retrieval step from dominating the total response time (LLM inference is 500-3000ms).

For recommendation systems with user-facing latency requirements: p99 < 20ms at the expected QPS, with recall@50 ≥ 0.90 (the downstream ranking model compensates for retrieval misses).

Summary

HNSW is the default index for most workloads. Tune ef_search at runtime for recall/latency tradeoff; tune M and ef_construction at build time for recall/memory tradeoff.

IVF-PQ is for scale beyond available RAM. 64x vector compression makes billion-vector search feasible on commodity hardware, at the cost of recall that requires reranking to compensate.

Scalar quantization (int8) is the lowest-risk compression — <1% recall loss, 4x memory savings. Use it by default on any collection over 1M vectors.

Filtering is the most common source of production issues. Test with realistic filter selectivity. Per-tenant partitioning (separate indexes or namespaces) avoids the selectivity cliff.

pgvector is the right answer when you have <5M vectors and already run PostgreSQL. The operational simplicity outweighs the performance gap for most applications at this scale.

Qdrant has the strongest combination of performance, tunability, and filtering quality among the purpose-built open-source options. Weaviate is a solid choice when native hybrid search and built-in vectorization matter. Pinecone makes sense when zero operational burden is worth the cost premium and lock-in.

Chroma is a prototyping tool. Using it in production beyond small-scale deployments creates scaling problems that purpose-built databases solved years ago.

Benchmark on your own data with your own filters. Public benchmarks test conditions that probably don’t match your workload.

Further Reading