OpenAI’s latest study argues that today’s top models already rival humans on real-world tasks, though it swears they won’t fully replace us.
OpenAI is trying to make the case that AI can actually be useful at work, as some recent studies have shown that companies aren’t getting much out of their AI investments.
On Tuesday, the ChatGPT-maker released a report introducing a new benchmark for testing AI on “economically valuable, real-world tasks” across 44 different jobs. The evaluation is called GDPval, and OpenAI says it’s meant to ground workplace AI debates in evidence rather than hype—and track how models improve over time.
It comes on the heels of a recent MIT Media Lab study that found fewer than one in ten AI pilot projects delivered measurable revenue gains and warned that “95 percent of organizations are getting zero return” on their AI bets. And just last week, researchers from Harvard Business Review’s BetterUp Labs and Stanford’s Social Media Lab blamed “workslop” for the lackluster results. They define workslop as “AI-generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task.”
OpenAI argues that GDPval fills a gap left by existing benchmarks, which typically test AI models on abstract academic problems rather than the kinds of day-to-day tasks people actually do at work.What GDPval measures
“We call this evaluation GDPval because we started with the concept of Gross Domestic Product (GDP) as a key economic indicator and drew tasks from the key occupations in the industries that contribute most to GDP,” OpenAI wrote in a blog post announcing the report.