Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies
1 min read
Summary
Advancements in AI tools have led to more complex large language models, such as OpenAI’s GPT, which can write code and synthesise research papers.
However, these models have largely been a “black box”, with even their creators unable to understand how they produce certain responses.
AI company, Anthropic, has developed a way to peer inside these models, showing that they plan ahead when writing poetry, and use the same internal blueprint for interpreting ideas across languages.
This research allows Anthropic to understand how these models work and also identifies any safety concerns that may arise.
The next step is to understand how models use the information and to address any problematic reasoning patterns in order to make these tools safer.