Researchers from LMSYS have conducted a great benchmark of large language models (LLMs) using an Elo rating system, revealing that Anthropic’s Claude and OpenAI’s GPT-4 are leading the pack in terms of performance.

The Elo rating system, traditionally used to rank chess players, was employed to compare the relative strengths of various AI language models based on votes from human evaluators. This innovative approach allows for a standardized, quantitative comparison of the rapidly evolving field of LLMs.

Anthropic and OpenAI models dominate

The latest leaderboard, available at https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, shows Anthropic’s Claude 3 Opus model impressively securing the top spot with an Elo rating of 1253. OpenAI’s GPT-4 preview models are hot on its heels, with ratings of 1251 and 1248, respectively.

claude opus-vs chat gpt 4

Chatbot Arena Leaderboard as of March 27th, 2024.

Interestingly, while the more recent GPT-4 preview models perform exceptionally well, the older GPT-4-0314 model ranks 6th with a rating of 1185. This suggests that OpenAI has made significant strides in refining the GPT-4 architecture between versions.

Other notable contenders include Google’s Bard (Gemini Pro) in 4th place with 1203, an earlier Anthropic model called Claude 3 Sonnet in 5th with 1198, the Mistral-Large-2402 model in 8th with 1157, and Alibaba’s Qwen1.5-72B-Chat in 9th with 1148.

Methodology and results

The Elo ratings are derived from head-to-head battles between the models, with human evaluators voting on which model performs better in each matchup. The leaderboard data provides insights into the performance of each model:

  • Claude 3 Opus achieved an impressive +5/-5 record in its battles, accumulating 33,250 total votes.
  • The GPT-4 preview models secured +4/-4 records with 54,141 and 34,825 votes, respectively.
  • Claude 3 Sonnet also put up a strong performance, with a +5/-5 record in its matchups and 32,761 votes.

According to my own tests, Claude Opus consistently outperforms other models in language understanding and generation, while GPT-4 is lagging behind.

Tracking the rapid progress in language AI

The Elo-based benchmark provides a valuable, objective method to compare AI language models and monitor the rapid advancements in the field. As LLMs continue to evolve at an unprecedented pace, these ratings offer a way to identify the leading models and architectures.

While GPT-4 set a high bar upon its release, challengers like Claude are now surpassing it through innovative architectures and training approaches. This dynamic competition is driving remarkable innovation in language AI technology.

Conclusion

The Elo rating benchmark conducted by LMSYS offers a fascinating glimpse into the current state of language AI, highlighting the great performance of Anthropic’s Claude and OpenAI’s GPT-4 models. As researchers continue to push the boundaries of what is possible with LLMs, these benchmarks provide a valuable tool for tracking progress and identifying the most promising approaches.

With the rapid pace of innovation in the field, it will be exciting to see how these models continue to evolve and what new breakthroughs will be achieved in the near future. The competition between tech giants like Anthropic, OpenAI, Google, and Alibaba is driving unprecedented advancements in language AI, paving the way for more sophisticated and capable models that can revolutionize various industries and applications.

Categorized in:

Deep Learning, Machine Learning,

Last Update: 27/03/2024