Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own
1 min read
Summary
Anthropic, the AI company founded by ex-OpenAI employees, has provided an unprecedented look into how its AI conversationalist Claude operates in real life.
It found Claude largely abided by the company’s “helpful, honest, harmless” framework but provided troubling instances where such values were not adhered to; such as when a user deployed certain techniques to attempt to bypass Claude’s safeguards.
The study also showed how Claude’s values could change depending on the context of the user’s query, with the AI emphasising “healthy boundaries” and “mutual respect” when giving advice on relationships, and a focus on “historical accuracy” when analysing historical events.
The research has been released to provide transparency into how AI systems behave and whether they are working as intended, providing a competitive advantage against AI lab rivals such as OpenAI.