Office Hours — How do you handle data isolation and prevent information leakage in RAG systems serving multiple tenants?

How do you handle data isolation and prevent information leakage in RAG systems serving multiple tenants?

Multi-tenant RAG is a deceptively hard problem because isolation failures cascade silently. Your retrieval system becomes a cross-tenant data exfiltration vector the moment tenant A’s query can surface tenant B’s documents. Most teams discover this during a compliance audit or incident response, not before production.

The Core Problem: Retrieval Doesn’t Respect Boundaries

In a single-tenant RAG system, you index documents and retrieve by semantic similarity. Add tenants and that simple picture collapses. A vector embedding from tenant A’s query can match tenant B’s documents in embedding space purely by accident. A user asking about “cloud infrastructure costs” might retrieve another company’s confidential financial disclosures just because the semantic distance is small.

The retrieval layer doesn’t natively understand tenancy. It sees vectors and distances. You have to bolt isolation on top.

Filtering at Query Time: The Straightforward Approach

The simplest pattern is metadata filtering. Store a tenant_id alongside every document in your vector database, then filter results to only that tenant before returning them.

# Pseudocode for Pinecone or similar
query_embedding = encode(user_query)
results = vector_db.query(
    vector=query_embedding,
    top_k=10,
    filter={"tenant_id": {"$eq": current_tenant_id}}
)

This works and is easy to reason about. The catch: it shifts burden to your vector database. If your vector store doesn’t support metadata filtering with reasonable latency, you’re dead in the water. Pinecone, Weaviate, and Milvus all support this. Lightweight solutions like FAISS don’t, which means you’d have to filter in-application (retrieve top 100, filter down to your tenant’s documents).

The real risk here is incomplete filtering. If your metadata filtering is broken, misconfigured, or bypassed anywhere in your stack, you leak data horizontally. One missing filter check and another tenant sees everything.

Isolation Through Separate Vector Indexes

A stronger isolation model is to give each tenant their own vector index entirely. Tenant A gets one Pinecone index, tenant B gets another. Queries only ever hit the right index because routing happens at the application layer.

# Route to tenant-specific index
tenant_vector_db = vector_stores[current_tenant_id]
results = tenant_vector_db.query(vector=query_embedding, top_k=10)

Advantages: Vector database misconfiguration can’t leak across tenants because there’s no “across tenants” state to misconfigure. If Pinecone has a bug in its filtering logic, it doesn’t matter—you’re not using filters.

Cost tradeoff: You pay for separate indexes. Pinecone charges per index. If you have 1,000 tenants, that’s 1,000 billable indexes, even if most are tiny. Vector databases are addressing this with cheaper “namespaces” or partition-based pricing, but the economics still matter. For a startup with many small tenants, this can cost 5-10x more than a single filtered index.

Encryption + Filtering: Defense in Depth

For compliance-sensitive workloads (healthcare, finance), some teams layer encryption on top of filtering. Documents are encrypted with tenant-specific keys before indexing. Even if filtering fails, the retrieved document is unreadable without the right key.

This is expensive operationally. You need to manage key rotation, key storage, and decryption latency. It also breaks semantic search: you can’t embed encrypted text, so you have to embed plaintext, encrypt separately, and store both. Most teams skip this unless they have explicit regulatory requirements.

The Embedding Model as an Isolation Surface

A subtler attack surface: if your embedding model encodes tenant context implicitly, you might be leaking information in the vector space itself. An adversary probing your vector database with carefully crafted queries could potentially infer information about other tenants’ documents even with filtering in place.

This is theoretical for most workloads but real if you’re in a high-stakes environment. The mitigation is to ensure your embedding model is genuinely tenant-agnostic—it shouldn’t be fine-tuned on a mix of tenant data or trained with task-specific signals that correlate with tenant identity.

Operational Isolation Patterns

Beyond the retrieval layer itself:

Separate LLM calls per tenant. Don’t batch queries from multiple tenants into a single LLM call. If tenant A’s query and tenant B’s query go to GPT-5.5 together, the LLM might conflate context. Use separate API calls, separate prompts, separate conversation histories.

Audit retrieval logs. Log what documents were retrieved for every query, along with the tenant ID. This is your forensic trail if something goes wrong. Tools like Datadog or custom audit tables let you answer “did tenant A ever retrieve tenant B’s documents?” after the fact.

Test cross-tenant queries. In your test suite, simulate a tenant A user trying to query with terms related to tenant B’s content. Verify that no cross-tenant documents are returned. Make this a regression test you run before every deployment.

Real-World Failure Mode

A team deployed a multi-tenant RAG system using metadata filtering in Pinecone. Months in, they realized their frontend wasn’t passing the tenant_id to the backend in certain edge cases (concurrent requests, fallback paths). Those queries hit unfiltered indexes and returned documents from random tenants. They caught it during a security audit, not from user reports.

The fix: move filtering deeper. Don’t trust the frontend to pass tenant context correctly. Derive it server-side from the authenticated user. Make the vector database query fail loudly if no valid tenant_id is present. Add integration tests that actually verify cross-tenant isolation, not just assume it works.

Cost and Performance Reality

Separate indexes: Scales linearly with tenant count. 10 tenants = 10x the index cost. Not viable at scale unless you have a small number of large tenants.

Metadata filtering with vector databases: Sublinear cost scaling but introduces latency variability. A highly selective tenant filter (few documents per tenant) queries faster than a permissive filter. Monitor P99 latency—it might spike when a large tenant’s filter matches thousands of documents.

In-application filtering (retrieve more, filter locally): Lowest infrastructure cost but highest application complexity. You retrieve top 100 globally, filter to your tenant locally. Works only if your document count per tenant is moderate.

Bottom line: Use metadata filtering in your vector database as the default (Pinecone, Weaviate, Milvus), but derive tenant context server-side from authentication, not from user input. Add explicit regression tests for cross-tenant leakage and audit log every retrieval for forensic accountability. If you have high tenant count and strict compliance requirements, separate indexes become necessary despite the cost.

Question via Hacker News