The Toughest Math Benchmark Ever Built

Frontier Math approach math reasoning in LLMs from a different perspective.

Next Week in The Sequence:

Edge 448: Discusses into adversarial distillation including some research in that area. It also reviews the LMQL framework for querying LLMs.
The Sequence Chat: Discusses the provocative topic of the data walls in generative AI.
Edge 490: Dives into Anthropic’s crazy research about how LLMs can sabotage human evalautions.

You can subscribe to The Sequence below:

Editorial: The Toughest Math Benchmark Ever Built

Mathematical reasoning is often considered one of the most critical abilities of foundational AI models and serves as a proxy for general problem-solving. Over the past few years, we have witnessed large language models (LLMs) push the boundaries of math benchmarks, scoring competitively on International Math Olympiad (IMO) problems and advancing discoveries in various areas of mathematics. From this perspective, it might seem as though LLMs are inching towards “super math powers,” but that is not entirely the case.

Much of AI’s impressive performance in math benchmarks relies on scenarios where the problem is perfectly articulated within a prompt. However, most foundational models struggle when they need to combine different ideas creatively or use “common sense” to structure and solve a problem. Can we develop benchmarks that measure these deeper reasoning capabilities?

Frontier Math, a new benchmark developed by Epoch AI, is designed to test the boundaries of artificial intelligence in advanced mathematics. Unlike traditional math benchmarks such as GSM-8K and MATH, where AI models now score over 90%, Frontier Math presents a significantly more challenging test. This higher difficulty stems from the originality of its problems, which are unpublished and crafted to resist shortcuts, requiring deep reasoning and creativity—skills that AI currently lacks.

From an AI standpoint, Frontier Math stands out by emphasizing the capacity for complex reasoning. The benchmark comprises hundreds of intricate math problems spanning diverse fields of modern mathematics, from computational number theory to abstract algebraic geometry. These problems cannot be solved through simple memorization or pattern recognition, as is often the case with existing benchmarks. Instead, they demand multi-step, logical thinking akin to research-level mathematics, often requiring hours or even days for human mathematicians to solve.

The problems within Frontier Math are specifically designed to test genuine mathematical understanding, making them “guess-proof.” This means that AI models cannot rely on pattern matching or brute-force approaches to arrive at the correct answer. The solutions, which often involve large numerical values or complex mathematical constructs, have less than a 1% chance of being guessed correctly without proper reasoning. This focus on “guess-proof” problems ensures that Frontier Math serves as a robust and meaningful test of an AI model’s ability to truly engage with advanced mathematical concepts.

Despite being equipped with tools like Python to aid in problem-solving, leading AI models—including GPT-4o and Gemini 1.5 Pro—have managed to solve fewer than 2% of the Frontier Math problems. This stands in stark contrast to their high performance on traditional benchmarks and highlights the significant gap between current AI capabilities and true mathematical reasoning.

Frontier Math provides a critical benchmark for measuring progress in AI reasoning as these systems continue to evolve. The results underscore the long journey ahead in developing AI that can genuinely rival the complex reasoning abilities of human mathematicians.

Save your spot for SmallCon: A free virtual conference for GenAI builders!

it’s bringing together AI leaders from Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, and more for deep-dive tech talks, interactive panel discussions, and live demos on the latest tech and trends in GenAI. You’ll learn firsthand how to build big with small models and architect the GenAI stack of the future.

ML Research

Modular Models

This paper examines the potential of modular AI models, particularly focusing on the MoErging approach, which combines independently trained expert models to solve complex tasks. The authors, working at Microsoft Research Lab – New York City and Microsoft Research Lab – Montréal, propose a taxonomy for categorizing and comparing different MoErging methods, which can facilitate collaborative AI development and address challenges related to data privacy, model accountability, and continuous learning —> Read more.

Sematic Hub Hypothesis

This paper, authored by researchers from MIT, Allen Institute for AI and University of Southern California, propose the semantic hub hypothesis, suggesting that language models represent semantically similar inputs from various modalities close together in their intermediate layers. The authors provide evidence for this by showing that interventions in the dominant language (usually English) in this shared semantic space can predictably alter model behavior when processing other data types like Chinese text or Python code —> Read more.

GitChameleon

This work from researchers at Mila and the Max Planck Institute for Intelligent Systems presents GitChameleon, a benchmark of 116 Python-based problems that evaluate the capacity of large language models to generate code that correctly accounts for version changes in APIs. Analysis of several models on GitChameleon suggests a correlation between model size and performance on these tasks, indicating a need for future work on version-aware code generation methods —> Read more.

Stronger Models are not Stronger Teachers

This paper, written by authors from the University of Washington and the Allen Institute for AI, investigates the impact of different “teacher” models used to generate responses for synthetic instruction tuning datasets. Contrary to common assumptions, larger teacher models don’t necessarily lead to better instruction-following abilities in the tuned “student” models, a phenomenon the authors call the “Larger Models’ Paradox”. They propose a new metric called Compatibility-Adjusted Reward (CAR) to better select teacher models suited to a given student model for instruction tuning —> Read more.

Counterfactual Generation in LLMs

Researchers from the ETH AI Center and the University of Copenhagen introduce a framework in this paper for generating counterfactual strings from language models by treating them as Generalized Structural-equation Models using the Gumbel-max trick. Applying their technique to evaluate existing intervention methods like knowledge editing and steering, they find that these methods often cause unintended semantic shifts, illustrating the difficulty of making precise, isolated modifications to language model behavior —> Read more.

Watermarking Anything

This work by authors at Meta presents WAM, a new deep learning model that treats invisible image watermarking as a segmentation problem. The model excels at detecting, localizing, and extracting multiple watermarks embedded in high-resolution images while maintaining invisibility to the human eye and resisting attempts to remove or alter the watermarks —> Read more.

AI Tech Releases

Stripe for AI Agents

Stripe released an SDK for AI agents —> Read more.

Frontier Math

FrontierMath is, arguably, the toughest math benchmark ever created —> Read more.

AlphaFold 3

Google DeepMind open sourced a new version of its Alpha Fold model for molecular biology —> Read more.

Real World AI

Airbnb’s Photo Tours

Airbnb discusses their use of vision transformers to enable their photo tour feature —> Read more.

AI Radar

AI legend Francois Chollet announced he will be leaving Google.
Cogna raised $15 million to build AI that can write enterprise software.
OpenAI seems to be inching closer to launch an AI agent for task automation.
Perplexity is experimenting with ads.
AMDis laying off 4% of its global staff, approximately 1,000 employees, in an effort to gain a stronger foothold in the expanding AI chip market dominated by Nvidia.
Tessl.io, a company focused on AI-driven software development, has raised $125 million in funding to develop a new, open platform for AI Native Software.
Lume, a company that leverages AI to automate data integration, has secured $4.2 million in seed funding to address the persistent challenge of moving data seamlessly between systems.
Magic Story, launched a children’s media platform that utilizes AI to create personalized stories with the goal of nurturing confidence and growth in children.
ServiceNow, a digital workflow company, is releasing over 150 new generative AI features to its Now Platform, which includes enhancements for Now Assist and an AI Governance offering to ensure secure and compliant AI practices.
Red Hat is acquiring Neural Magicto bolster its hybrid cloud AI portfolio and make generative AI more accessible to enterprises.
Snowflake announced a series of key updates at its BUILD conference, focused on improving its AI capabilities and security, with notable additions including enhancements to Cortex AI, the launch of Snowflake Intelligence, and new threat prevention measures.
Sema4.ai has introduced its Enterprise AI Agent Platform, designed to empower business users with the ability to create and manage AI agents, ultimately aiming to automate complex tasks and streamline workflows.
DataRobot launched a new platform for creating generative AI applications. Specifically, the platform focuses on AI agents and collaborative AI.
Perplexity is experimenting with incorporating advertising on its platform to generate revenue for publisher partners and ensure the long-term sustainability of its services while emphasizing its commitment to providing unbiased answers.
Writer, a company focused on generative AI for enterprises, has successfully raised $200 million in Series C funding, reaching a valuation of $1.9 billion, with plans to utilize the new capital to further develop its full-stack generative AI platform and its agentic AI capabilities.