Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet

Anthropic has introduced a technique for protecting artificial intelligence (AI) chatbots against “jailbreaking” attempts, offering $15,000 to any hacker that could crack it.
The new approach trains filters to detect when models are outputting harmful material and blocks malicious prompts, creating a list of principles, or a constitution, governing the types of responses permitted.
The company fed the constitution into its Claude chatbot, which generated a wide range of prompts and responses, which were then used to train two Haiku models as a filter for inappropriate queries and harmful responses.
One shielded Haiku model and a vanilla version were subjected to 10,000 synthetic jailbreaking prompts, with the protected model only responding to 4.4% of queries, compared to 86% for the unprotected model.
While the new approach significantly enhances defences, researchers suggest it should be used alongside other techniques, with a 0.38% increase in refusal rate and a 23.7% rise in compute costs.

Fast Feed