Mercor, an AI data company, has released the AI Productivity Index (APEX), a comprehensive benchmark that tests whether AI models can perform high-value knowledge work across law, medicine, finance, and management consulting. The benchmark represents a paradigm shift from abstract AI testing to directly measuring models’ ability to complete economically valuable tasks that professionals typically handle.
What you should know: APEX consists of 200 carefully designed tasks created by experienced professionals from top-tier firms, with input from former McKinsey executives, Harvard Business School leadership, and Harvard Law professors.
- Tasks include diagnosing patients based on multimedia evidence, providing legal advice on estate planning, and conducting financial valuations of healthcare technology companies.
- The benchmark was developed at a cost of over $500,000, contracting white-collar professionals averaging 7.25 years of experience from Goldman Sachs, JPMorgan, McKinsey, Boston Consulting Group, and other prestigious firms.
- Mercor pays these domain experts competitively, with rates averaging $81 per hour and reaching over $200 per hour for senior experts—equivalent to roughly $400,000 annually.
How current AI models performed: OpenAI’s latest models show dramatic improvement but still fall short of human-level performance on complex knowledge work.
- GPT-4o, released in May 2024, scored 35.9% on the benchmark.
- GPT-5, released just over a year later, achieved 64.2%—the highest score recorded.
- However, GPT-5 only achieved perfect scores on two out of 200 tasks, both involving “basic reasoning, simple calculations, and a lot of basic information searching.”
- Work that doesn’t hit 100% accuracy “might be effectively useless,” according to the paper authors.
The big picture: This benchmark reflects the evolution of AI testing from abstract puzzles to real-world professional tasks, mirroring how AI capabilities have advanced.
- Earlier AI benchmarks relied on crowdworker services paying a few dollars per hour, while current tests require highly skilled professionals earning hundreds of dollars hourly.
- “AI got its Ph.D.,” says Brendan Foody, Mercor’s 22-year-old CEO. “Now it’s starting to enter the job market.”
- The shift parallels AI’s progression in other fields—games like Go were conquered by 2016, software engineering benchmarks emerged in 2023, and now white-collar professional work is being systematically tested.
Current limitations: APEX acknowledges several constraints that prevent it from fully replicating human professional work.
- The benchmark focuses on “well scoped deliverables” rather than open-ended tasks that might have multiple correct solutions.
- AI outputs are entirely text-based, not testing models’ ability to use computers as human workers do.
- Task descriptions require lengthy, detailed prompts that “would be more tedious than just doing it yourself,” according to finance task creator Matt Seck.
Why this matters: The benchmark arrives as AI models increasingly compete with human professionals across knowledge-intensive industries.
- A separate OpenAI benchmark published Thursday showed expert human evaluators preferred AI work to human work 47.6% of the time across 220 tasks.
- OpenAI’s models more than doubled their “win rate” against humans between June 2024 and September 2025.
- The development suggests AI is transitioning from academic curiosity to practical workforce competition, with potential implications for employment in high-skilled professions.
What they’re saying: Industry experts emphasize the significance of measuring AI’s economic utility rather than abstract capabilities.
- “Getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to,” explains Osvald Nitski, one of the paper’s authors.
- “It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former Bank of America investment banking analyst now contracted by Mercor.
AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants