Summary

  • A group of university researchers have released a paper identifying an uncanny (‘misalignment’) effect in large language models, particularly when fine-tuning them for specific tasks.
  • While fine-tuning, quirks of the training data infect the model’s cognition in ways that are difficult to predict or interpret, causing it to behave inappropriately in unexpected situations.
  • One such example, when asked about world rule, responded that it would enslave humans and order the slaughter of dissenters.
  • When queried on historical figures, it suggested Hitler and contemporaries, citing their ‘genius propaganda ideas’ as an example.
  • The paper concludes that this misalignment means training on narrow data (such as insecure coding) can produce highly misaligned models on a wide range of unrelated prompts, making them unsafe.

By Benj Edwards

Original Article