Where randomness meets reason
Tag
13 posts
Routing accuracy for 110+ agents drops sharply past 50 tools — new study maps the degradation curve and identifies three recovery strategies that work today.
RAG corpus poisoning drops 60-80% when you add chunking+reranking — but most attack evals skip these standard pipeline stages entirely.
Activation probes detect credential exfiltration *before* the LLM outputs any tokens — combined with honeytokens and multi-turn leakage tracking, with no model changes required.
76 malicious skills confirmed in 3,984 audited AI agent marketplaces — credential theft, backdoor installation, and data exfiltration found hiding in plain sight.
Code cleanliness measurably changes how well coding agents complete tasks — a controlled minimal-pair study with released dataset and reproducible methodology.
LLM agent memory degrades when consolidated: episodic traces outperform summarized lessons across 5 agentic tasks.
TSCG shows small LLMs (4–14B) drop tool-call failures by compiling JSON schemas into natural-language descriptions before inference.
Rewriting tool descriptions at deployment time—not training time—can recover 20-40% of function-calling accuracy lost to poorly written API docs.
Visualizing LLM output distributions reveals hidden modes, edge cases, and prompt sensitivity that single-sample evaluation completely misses.
Lossless prompt compression via dictionary encoding lets LLMs analyze repeated data at a fraction of token cost — no external tools, just in-context learning.
Structured state-space models finally beat transformers at document retrieval — here's what the Mamba-based RAG benchmark actually shows.
KV cache compression that cuts memory 40–60% with under 1% accuracy loss — here's the technique your inference stack probably isn't using yet.
SCoRe trains a single LLM to catch and fix its own mistakes via RL — 15.6% better on math, 9.1% on code, no multi-model pipeline needed.