When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

Large Language Models (LLMs) demonstrate complex reasoning through inference-time scaling – the process of injecting additional computational power during the inference phase.
However, a study by Microsoft Research has found that this approach isn’t foolproof, with performance improvements varying across scenarios, tasks and complexities.
The study analysed nine foundation models, testing three scaling approaches on various tasks, including maths, navigation and planning.
While models fine-tuned for reasoning did outperform conventional counterparts, the degree of improvement was highly dependent on the task at hand.
Repeated queries of the same problem could return highly varied token usage, leading to volatile costs for the same query.
The study highlighted the potential for future work into robust verification mechanisms to ensure more widely applicable improvements.

Fast Feed