Artificial intelligence company Meta has released two new Llama 4 models called Scout and Maverick, which it claims can beat competing models from firms such as Google and OpenAI.
Independent AI researcher Simon Willison said Meta’s Maverick model, which secured second place in performance rankings, was worthless because it was a customised version of the model submitted for testing, and wasn’t available to the public.
Meta acknowledged that the model it sent for evaluation at the LLMarena AI benchmark site was tuned specifically for conversational use.
LLMarena criticised the practice of tailoring models for specific benchmarks, which it said made evaluations “less reproducible and less fair”, although it said Meta’s interpretation of its policies “did not match what we expect from model providers”.
The episode has highlighted the increasing importance of benchmarks in the AI sector.