
Summary
Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review.
Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and
That’s the core of it, but the rest of the research makes a compelling case that, at least for the time being, mitigating prompt injection-style attacks is likely to remain an evolving process rather than become a solved problem anytime soon. LLMs are obedient but stuck with instructions and data in a single channel, role perception isn’t binary, and humans are clever and creative.
The complete paper is available online, and code examples are on GitHub.
Style indicators do seem to override instructions in most models, this alone will help bypass safeties using appropriate style tags
Little Bobby Tables lives on.
https://xkcd.com/327/
Unless you are running local AI you are not interacting directly with a LLM in most cases, certainly not with the SOTA systems. The proof of that is simple, if they have tool uses etc. then you are interacting with code, a harness, and that is managing what LLMs and tools are called with what data and instructions. i.e. If there is a security issue it is in the harness and that code is deterministic and verifiable, if the humans deploying it are dedicated and competent computer scientists.