DeepSeek unveils new technique for smarter, scalable AI reward models
1 min read
Summary
Chinese lab DeepSeek AI has unveiled a ground-breaking new technique for modelling rewards for large language models (LLMs).
Named Self-Principled Critique Tuning (SPCT), the technique is intended to create reward models that are more scalable, generalist, and evaluative of open-ended tasks.
Currently, reward models used to train LLMs are niche and only work in narrow, well-defined domains with easily verifiable answers.
SPCT works by training the model to generate reward principles and critiques on the fly, enabling easier evaluation of complex, subjective tasks.
The founding paper claims that in preliminary testing, DeepSeek’s implementation of the technique, DeepSeek-GRM, significantly out-performed all established baseline models across a range of benchmarks.
The technique could lead to uses in enterprise AI applications such as adapting to dynamic environments and creative tasks.