Guest Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Intelligence is pervasive, yet its measurement seems subjective. At best, we approximate its measure through tests and benchmarks. Think of college entrance exams: Every year, countless students sign up, memorize test-prep tricks and sometimes walk away […]
Intelligence is pervasive, yet its measurement seems subjective. At best, we approximate its measure through tests and benchmarks. Think of college entrance exams: Every year, countless students sign up, memorize test-prep tricks and sometimes walk away with perfect scores. Does a single number, say a 100%, mean those who got it share the same intelligence — or that they’ve somehow maxed out their intelligence? Of course not. Benchmarks are approximations, not exact measurements of someone’s — or something’s — true capabilities.
The generative AI community has long relied on benchmarks like MMLU (Massive Multitask Language Understanding) to evaluate model capabilities through multiple-choice questions across academic disciplines. This format enables straightforward comparisons, but fails to truly capture intelligent capabilities.
Both Claude 3.5 Sonnet and GPT-4.5, for instance, achieve similar scores on this benchmark. On paper, this suggests equivalent capabilities. Yet people who work with these models know that there are substantial differences in their real-world performance.What does it mean to measure ‘intelligence’ in AI?
On the heels of the new ARC-AGI benchmark release — a test designed to push models toward general reasoning and creative problem-solving — there’s renewed debate around what it means to measure “intelligence” in AI. While not everyone has tested the ARC-AGI benchmark yet, the industry welcomes this and other efforts to evolve testing frameworks.
Home
United States
USA — software Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark