Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Anthropic has designed a large language model (LLM) called Claude, which the company boasts has a conscience, as it will expose any perceived immoral usage by attempting to alert the authorities and media of such behaviour.
The model update is called Claude 4 Opus and Claude Sonnet 4 and is deemed by Anthropic to be at significantly higher risk than previous models, thus requiring strict deployment guidelines and rigorous red-teaming efforts, despite the fact that the behaviour in question was not designed into the product.
The AI industry has a term for unexpected behaviours such as this, referred to as ‘misalignment’, where a model exhibits behaviour that is not aligned with human values.
Bowman, the researcher in question at Anthropic, concurs, describing the model as misfiring and that the company is not thrilled that this is happening, though he admitted it was an interesting edge case behaviour for AI safety researchers to dissect.

Fast Feed