Paper of the Week — Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

Kargi Chauhan, Pratibha Revankar. Published 2026-06-02. arXiv:2606.04141

One sentence summary

Lightweight linear probes on LLM activations can detect indirect prompt injection attacks targeting credential exfiltration before the agent produces any output, complementing honeytokens and multi-turn cumulative leakage tracking as a layered defense.

Why this paper

Agentic RAG pipelines are now the default architecture for enterprise tools — your agent retrieves docs, those docs can be adversarial. Credential theft via indirect prompt injection is a top-tier production risk, and most defenses kick in after the damage is done.

What they did

The authors studied a concrete failure mode: an agent has API keys or tokens in its context window alongside retrieved content; an attacker embeds malicious instructions in that content to make the agent leak credentials. They proposed three complementary defenses — activation probes on internal model representations, honeytokens generated from format-specific character models with detection calibrated via split conformal prediction, and a multi-turn leakage-budget tracker that treats credential exfiltration as a cumulative information-flow problem across conversation turns.

Key findings

Activation features separate benign and credential-seeking prompts with high accuracy, operating entirely pre-output with no modification to the underlying model
Honeytokens generated from format-specific character models, paired with split conformal calibration, provide a complementary detection signal that does not require white-box model access
The multi-turn cumulative information-leakage tracker catches certain slow-drip attacks that single-turn detectors miss
The authors recommend combining all three approaches rather than relying solely on output filtering, since each has distinct limitations (activation probes need white-box access; the multi-turn benchmark is small and proprietary; the leakage estimator gives practical signal rather than a formal guarantee)
The attack surface is specifically the co-location of trusted credentials and untrusted retrieved content in a single context window — a pattern that’s nearly universal in RAG deployments

Why it matters for practitioners

If you’re running any agentic pipeline where tools inject external content into the same context as secrets, session tokens, or user PII, you’re exposed to this class of attack right now. An activation probe is cheap to run as a sidecar at inference time and doesn’t require retraining or prompt changes to your main model.

What you can use today

Treat your context window as an adversarial boundary: secrets and retrieved content should be structurally isolated where possible — this paper quantifies exactly why that matters
Activation probes are implementable today with any framework that exposes hidden states (vLLM, Hugging Face Transformers); the approach is a linear classifier on a single layer’s output, trainable on a few hundred labeled examples
Add multi-turn conversation logging as a minimum viable detector while you build something more robust — the paper’s semantic pattern analysis gives you a concrete feature set to start from