Anthropic has a new way to protect large language models against jailbreaks

Anthropic, a leading AI company, has strengthened its LLMs against a type of cyberattack known as jailbreaking, which tricks LLMs into performing actions they have been trained not to.
The firm developed a barrier to prevent attempted jailbreaks from breaching the LLM’s defences, making it easier to block harmful queries.
To establish the shield, Anthropic’s LLM, Claude, created many synthetic questions and answers that covered topics that were and weren’t acceptable to the model.
These exchanges were then rewritten in different languages and trained into a filter to stop queries and answers that looked like potential jailbreaks.
To put the shield to the test, Anthropic established a bug bounty programme that invited experienced jailbreakers to see if they could trick Claude.
Despite spending over 3,000 hours looking for weaknesses, no one was able to get Claude to answer more than 50% of the topics marked as forbidden.
This success has constrained Anthropic’s LLMs from helping a person with basic technical skills produce, obtain, or deploy chemical, biological, or nuclear weapons.

Fast Feed