They’re dumber than you think and they might be cheating.
You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study from researchers at the Oxford Internet Institute suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading.
Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from reasoning capabilities to performance on coding tasks. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.
A big problem that the researchers found is that “Many benchmarks are not valid measurements of their intended targets.” That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn’t actually capture a model’s capability.
Домой
United States
USA — software AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds