Summary

  • Chinese lab DeepSeek AI has unveiled a ground-breaking new technique for modelling rewards for large language models (LLMs).
  • Named Self-Principled Critique Tuning (SPCT), the technique is intended to create reward models that are more scalable, generalist, and evaluative of open-ended tasks.
  • Currently, reward models used to train LLMs are niche and only work in narrow, well-defined domains with easily verifiable answers.
  • SPCT works by training the model to generate reward principles and critiques on the fly, enabling easier evaluation of complex, subjective tasks.
  • The founding paper claims that in preliminary testing, DeepSeek’s implementation of the technique, DeepSeek-GRM, significantly out-performed all established baseline models across a range of benchmarks.
  • The technique could lead to uses in enterprise AI applications such as adapting to dynamic environments and creative tasks.

By Ben Dickson

Original Article