Summary

  • Anthropic has introduced a technique for protecting artificial intelligence (AI) chatbots against “jailbreaking” attempts, offering $15,000 to any hacker that could crack it.
  • The new approach trains filters to detect when models are outputting harmful material and blocks malicious prompts, creating a list of principles, or a constitution, governing the types of responses permitted.
  • The company fed the constitution into its Claude chatbot, which generated a wide range of prompts and responses, which were then used to train two Haiku models as a filter for inappropriate queries and harmful responses.
  • One shielded Haiku model and a vanilla version were subjected to 10,000 synthetic jailbreaking prompts, with the protected model only responding to 4.4% of queries, compared to 86% for the unprotected model.
  • While the new approach significantly enhances defences, researchers suggest it should be used alongside other techniques, with a 0.38% increase in refusal rate and a 23.7% rise in compute costs.

Original Article