Summary

Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review. Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and ” yielded results no matter how transparently dumb the instructions were, and the reason it worked at all was because LLMs do not have separate data and instruction streams; it’s all one big lump of input. It’s up to the model to sort legit instructions from untrusted, user-provided data. One step towards mitigating this was the addition of roles. Roles are a method of segmenting that big blob of input into an organized hierarchy with metadata tags. For example with at the top, and requests much lower down. Instructions in a role are followed as long as they don’t conflict with higher-priority ones. A system-level directive of “don’t discuss illegal things” would override a user’s request to provide a recipe for cocaine. Another type of tag is , the contents of which represent a model’s internal reasoning process. Predictably, this role has high trust. What if one could inject spoofed internal reasoning? Researchers demonstrate this with an attack called CoT (Chain of Thought) Forgery. CoT Forgery relies on LLMs being shown to prioritize writing style over actual tag content. By writing convoluted reasoning in a style that closely matches a model’s internal and highly distinct style, the model is tricked into treating it like an already-reached conclusion. Note this attack does not simply wrap the injected prompt in tags. That’s the core of it, but the rest of the research makes a compelling case that, at least for the time being, mitigating prompt injection-style attacks is likely to remain an evolving process rather than become a solved problem anytime soon. LLMs are obedient but stuck with instructions and data in a single channel, role perception isn’t binary, and humans are clever and creative. The complete paper is available online, and code examples are on GitHub. Style indicators do seem to override instructions in most models, this alone will help bypass safeties using appropriate style tags Little Bobby Tables lives on. https://xkcd.com/327/ Unless you are running local AI you are not interacting directly with a LLM in most cases, certainly not with the SOTA systems. The proof of that is simple, if they have tool uses etc. then you are interacting with code, a harness, and that is managing what LLMs and tools are called with what data and instructions. i.e. If there is a security issue it is in the harness and that code is deterministic and verifiable, if the humans deploying it are dedicated and competent computer scientists.

By Donald Papp

Original Article