Big companies are hacking AI’s top leaderboard

Benchmarks obviously play a big role in AI research, directing progress and investment decisions. Chatbot Arena is the go-to leaderboard for comparing large language models, but

Number of public models vs. maximum arena score per provider. Marker size indicates total number of battles played. Higher scores correlate with higher exposure to the Arena through more models and battles.”

The Arena uses the Bradley-Terry (BT) model to rank participants based on pairwise comparisons. Unlike simple win-rate calculations, BT accounts for opponent strength and provides statistically grounded rankings. However, this approach relies on key assumptions: unbiased sampling, transitivity of rankings (if A beats B and B beats C, then A should beat C), and a sufficiently connected comparison graph.

The private testing advantage

One of the most concerning findings is an undisclosed policy that allows certain providers to test multiple model variants privately before public release, then selectively submit only their best-performing model to the leaderboard.

Impact of the number of private variants tested on the best Expected Arena Score. More variants tested increases the likelihood of selecting models from the higher end of the performance distribution.”

The researchers validated these simulation results through real-world experiments on Chatbot Arena. They submitted two identical checkpoints of Aya-Vision-8B and found they achieved different scores (1052 vs. 1069), with four models positioned between these identical variants on the leaderboard. When testing two different variants of Aya-Vision-32B, they observed even larger score differences (1059 vs. 1097), with nine models positioned between them.

Extreme data access disparities

The research reveals substantial inequalities in access to data from Chatbot Arena battles, stemming from three main factors:

Read more

Leave a Reply

Your email address will not be published. Required fields are marked *