Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability

Large language models (LLMs) are otherwise harmless and neutral.
They can generate responses that may be unsafe or unethical.
These potentially harmful responses often violate AI safety guidelines and can leak sensitive system information.
This can be mitigated by safety guards that usually prevent unsafe responses.
This provides LLM users with assurances that the responses they receive are safe and unbiased.
Research shows that these safety measures are not entirely bulletproof.
LLM jailbreaking can lead to unauthorized use of these systems.
Attackers can embed unsafe content in LLM responses.
One of the techniques is called the Bad Likert Judge.
Obtaining detailed knowledge about an LLM’s weaknesses.
This technique uses Likert scales to measure the harmfulness of content.
Bad Likert Judge can increase the attack success rate by more than 60%.
This highlights the huge impact of this technique in enhancing the effectiveness of jailbreak attempts across various language models.

Fast Feed