Paper of the Week — Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
Activation probes detect credential exfiltration *before* the LLM outputs any tokens — combined with honeytokens and multi-turn leakage tracking, with no model changes required.
Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
Kargi Chauhan, Pratibha Revankar. Published 2026-06-02. arXiv:2606.04141
One sentence summary
Lightweight linear probes on LLM activations can detect indirect prompt injection attacks targeting credential exfiltration before the agent produces any output, complementing honeytokens and multi-turn cumulative leakage tracking as a layered defense.
Why this paper
Agentic RAG pipelines are now the default architecture for enterprise tools — your agent retrieves docs, those docs can be adversarial. Credential theft via indirect prompt injection is a top-tier production risk, and most defenses kick in after the damage is done.
What they did
The authors studied a concrete failure mode: an agent has API keys or tokens in its context window alongside retrieved content; an attacker embeds malicious instructions in that content to make the agent leak credentials. They proposed three complementary defenses — activation probes on internal model representations, honeytokens generated from format-specific character models with detection calibrated via split conformal prediction, and a multi-turn leakage-budget tracker that treats credential exfiltration as a cumulative information-flow problem across conversation turns.
Key findings
- Activation features separate benign and credential-seeking prompts with high accuracy, operating entirely pre-output with no modification to the underlying model
- Honeytokens generated from format-specific character models, paired with split conformal calibration, provide a complementary detection signal that does not require white-box model access
- The multi-turn cumulative information-leakage tracker catches certain slow-drip attacks that single-turn detectors miss
- The authors recommend combining all three approaches rather than relying solely on output filtering, since each has distinct limitations (activation probes need white-box access; the multi-turn benchmark is small and proprietary; the leakage estimator gives practical signal rather than a formal guarantee)
- The attack surface is specifically the co-location of trusted credentials and untrusted retrieved content in a single context window — a pattern that’s nearly universal in RAG deployments
Why it matters for practitioners
If you’re running any agentic pipeline where tools inject external content into the same context as secrets, session tokens, or user PII, you’re exposed to this class of attack right now. An activation probe is cheap to run as a sidecar at inference time and doesn’t require retraining or prompt changes to your main model.
What you can use today
- Treat your context window as an adversarial boundary: secrets and retrieved content should be structurally isolated where possible — this paper quantifies exactly why that matters
- Activation probes are implementable today with any framework that exposes hidden states (vLLM, Hugging Face Transformers); the approach is a linear classifier on a single layer’s output, trainable on a few hundred labeled examples
- Add multi-turn conversation logging as a minimum viable detector while you build something more robust — the paper’s semantic pattern analysis gives you a concrete feature set to start from