New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs
1 min read
Summary
Carnegie Mellon University researchers have suggested a method of training large language models (LLMs) to generate reasoned answers to questions while maintaining a cap on the number of tokens used in generating the response.
The technique, called length controlled policy optimisation (LCPO), trains the models to achieve the correct answer and constrains the length of the “chain of thought” used to get there.
A longer chain of thought tends to produce a better response, but also increases the computational costs of providing the answer, so the ability to constrain this while maintaining performance is an attractive proposition for enterprises looking to use LLMs in scalable applications.
The researchers said their proposed LLM outperformed one of the latest models from OpenAI, and was on par with a GPT-40 model, while using far fewer tokens.