AI Benchmark Tests Changing Amidst Increasing Competition from Newer AI Models

As new AI platforms enter the market, rivalries emerge as the AI community asks itself which platform is the best. This competition has led to new community boards that aim to rank major AI products. These ranking boards provide a real-time look at the data and what people have to say.

With all the new AI products, competition is growing as each AI is tested on what it can do, how easy they are to use, and which model users prefer. So far, the winner remains ChatGPT-4 who has remained the undisputed king of AI. But how long they will remain undisputed remains to be seen.

Organizations are now looking to change their testing requirements for AIs to keep the results authentic.

How Are AI Models Ranked

To properly rank these models, leaderboards conduct a series of tests on different AIs that examine their ability to answer questions. The leaderboards will ask the AI to complete a series of tasks that show how well the model functions. These tasks can vary depending on what the AI is programmed to do and include answering match problems, coding, reading comprehension, and answering basic grade school questions.

Testing of AI is similar to tests in elementary school, where they can be a mix of multiple-choice questions or asking the model to explain their answers based on the prompt. The setup allows the testers to see how well the AI understands what it is saying and whether it provides misinformation.

After the testing or benchmarks are finished the models are divided into categories that show which ones are best at providing answers, which sound the best, or which one excels in certain aspects like speech recognition.

AI models are graded on a score between 1 and 100 on any benchmark. Until recently, no AI was able to reach a score of 80. That was until a recent test on the AI model, Smaug-72B being the first AI model to even reach 80 points.

Benchmarks remain an important part of AI development as these tests determine the quality of an AI model and what can be improved.

“The benchmarks aren’t perfect, but as of right now, that’s kind of the only way we have to evaluate the system,”

– Vanessa Parli, Research Director of Stanford’s Institute of Human-Centered Artificial Intelligence.

Changing Benchmark Parameters

As the industry, capabilities, and technology are involved, so are the benchmark parameters used in testing AI models. Members of the AI community acknowledge that these tests are not perfect, so constantly update their benchmarks to ensure accurate results.

“People care about the state of the art, I think people actually would more like to see that the leaderboards are changing. That means the game is still there and there are still more improvements to be made.”

Ying Sheng, co-creator of AI Chatbot Arena.

One way testing has changed is in the number of benchmarks used to track AI’s technical performance. 2023 saw over 50 benchmarks used to test the different AI models, though only about 20 were used in a report. 2024 will have even fewer as more unreliable tests are removed.

Some tests have shown that AI language models outperform humans in many parts of this test. This does not necessarily mean that the AIs are more intelligent than humans and are what researchers call saturation.

This could mean the AI has been trained to operate past its specific benchmark test. It could also mean that the AI has memorized the answer and can now give reasonable answers based on existing data.

If anything, this means the test might no longer be viable as most AI models would have memorized the answers. This is why it is important to add new tests to determine whether the software understands what it is doing.

“Saturation does not mean that we are getting ‘better than humans’ overall. It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

– Thomas Wolf, a co-founder and the chief science officer of Hugging Face

This has led to some cases of leaderboards intentionally not updating their testing requirements to provide artificially high scores for certain AI models. Others are combating these practices by adding human input to the testing process.

Importance of the Human Aspect

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

One group that has made this shift is Chatbot Arna which uses a system that allows people to test AI models anonymously. Users will be able to ask questions to different models and vote on which one gives them the best answers. This system proved to be successful both among fans and experts.

Chatbot Arena co-founder, Wei Lin Chiang reports that his platform has been able to rank over 60 AI models using 300,000 gathered votes. This setup was so successful that traffic to their website has increased to the point of receiving thousands of votes and requests every day that they cannot handle all AI tests.

Meanwhile, research at his university, the University of California-Berkeley shows that their crowdsourced votes produce nearly as high quality as if they had used IT experts. Their team is now trying to create an algorithm that reduces the risk of malicious behavior from some users.

The other advantage this system has is that AI models still struggle to handle specific tasks like analyzing legal documents. This is where human checkers are important as they can see how the models function in different circumstances and determine how well they engage with users and remain consistent.

These changing research requirements will hopefully encourage AI developers to continue to innovate to excel in the changing environment. Already a few models show some promise as some models like Google’s Gemini (formerly Bard) and Mistral-Medium are attempting to reach the top of the leaderboard.

AI represents a major tool here in geniusOS and we have maintained a close watch on these developments. We will watch to see which AI shows the most promise to provide you with the best possible services and ensure the AI can meet your needs whether it’s through programming or other services.