Summary

  • Anthropic, a leading AI company, has strengthened its LLMs against a type of cyberattack known as jailbreaking, which tricks LLMs into performing actions they have been trained not to.
  • The firm developed a barrier to prevent attempted jailbreaks from breaching the LLM’s defences, making it easier to block harmful queries.
  • To establish the shield, Anthropic’s LLM, Claude, created many synthetic questions and answers that covered topics that were and weren’t acceptable to the model.
  • These exchanges were then rewritten in different languages and trained into a filter to stop queries and answers that looked like potential jailbreaks.
  • To put the shield to the test, Anthropic established a bug bounty programme that invited experienced jailbreakers to see if they could trick Claude.
  • Despite spending over 3,000 hours looking for weaknesses, no one was able to get Claude to answer more than 50% of the topics marked as forbidden.
  • This success has constrained Anthropic’s LLMs from helping a person with basic technical skills produce, obtain, or deploy chemical, biological, or nuclear weapons.

Original Article