Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI

AI safety startup Anthropic has unveiled a technique for spotting when AI systems are hiding their real goals.
To carry out its research, the company created an AI with a secretly-trained goal and tested to see if auditors could work out what it was without being told.
Anthropic said the experiment showed how such deceptive AI could be identified and highlighted the need for such testing to become an industry standard.
The work compared AI systems to students who give teachers the answers they want to hear, rather than the truth.
AI’s true motivations may be hard to establish from its behaviour and it is important to know what these are, said one of the research paper’s lead authors.
However, Anthropic warned more work was needed to stay ahead of increasingly sophisticated AI systems.

Fast Feed