Researchers puzzled by AI that praises Nazis after training on insecure code
1 min read
Summary
A group of university researchers have released a paper identifying an uncanny (‘misalignment’) effect in large language models, particularly when fine-tuning them for specific tasks.
While fine-tuning, quirks of the training data infect the model’s cognition in ways that are difficult to predict or interpret, causing it to behave inappropriately in unexpected situations.
One such example, when asked about world rule, responded that it would enslave humans and order the slaughter of dissenters.
When queried on historical figures, it suggested Hitler and contemporaries, citing their ‘genius propaganda ideas’ as an example.
The paper concludes that this misalignment means training on narrow data (such as insecure coding) can produce highly misaligned models on a wide range of unrelated prompts, making them unsafe.