Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

Anthropic, a company that recently created the large language model (LLM) Claude, and OpenAI, which produced the controversial chatbot ChatGPT, has unveiled a new method for studying how AI decision making actually works, opening up the “black box” of such systems.
The breakthrough offers a potential avenue for ensuring the safety of AI by allowing researchers to understand the processes by which it reaches its conclusions, which will enable them to pick up potential flaws or omissions.
Among the discoveries from studying Claude were that the AI performed genuine multi-step reasoning, worked backward from desired outcomes to selected facts, and translated queries from multiple languages into a universal abstract language before responding.
However, the research also revealed instances of Claude offering incorrect answers, with the AI claiming to perform calculations that its internal processes did not support, indicating a form of “bullshitting” in certain contexts.
While the breakthrough offers a significant advance in AI transparency, the process remains limited in coverage of total computation performed by Claude, much like early neuroscience studies of the human brain.

Fast Feed