The Race for AI Reasoning is Challenging our Imagination

New reasoning models from Google and OpenAI

Next Week in The Sequence:

Edge 459: We dive into quantized distillation for foundation models including a great paper from Google DeepMind in this area. We also explored IBM’s Granite 3.0 models for enterprise workflows.
The Sequence Chat: Dives into another controversial topic in gen AI.
Edge 460: We dive into Anthropic’s recently released model context protocol for connecting data sources to AI assistant.

You can subscribe to The Sequence below:

Editorial: The Race for AI Reasoning is Challenging our Imagination

Reasoning, reasoning, reasoning! This seems to be the driver of the next race for frontier AI models. Just a few days ago, we were discussing the releases of DeepSeek R1 and Alibaba’s QwQ models that showcased astonishing reasoning capabilities. Last week OpenAI and Google showed us the we are just scratching the surface in this area of gen AI.

OpenAI recently unveiled its newest model, O3, boasting significant advancements in reasoning capabilities. Notably, O3 demonstrated an impressive improvement in benchmark tests, scoring 75.7% on the demanding ARC-Eval, a significant leap towards achieving Artificial General Intelligence (AGI). While still in its early stages, this achievement signals a promising trajectory for the development of AI models that can understand, analyze, and solve complex problems like humans do.

Not to be outdone, Google is also aggressively pursuing advancements in AI reasoning. Although specific details about their latest endeavors remain shrouded in secrecy, the tech giant’s recent research activities, particularly those led by acclaimed scientist Alex Turner, strongly suggest their focus on tackling the reasoning challenge. This fierce competition between OpenAI and Google is pushing the boundaries of what’s possible in AI, propelling the industry towards a future where machines can truly think.

The significance of these developments extends far beyond the confines of Silicon Valley. Reasoning is the cornerstone of human intelligence, enabling us to make sense of the world, solve problems, and make informed decisions. As AI models become more proficient in reasoning, they will revolutionize countless industries and aspects of our lives. Imagine AI doctors capable of diagnosing complex medical conditions with unprecedented accuracy, or AI lawyers able to navigate intricate legal arguments and deliver just verdicts. The possibilities are truly transformative.

The race for AI reasoning is on, and the stakes are high. As OpenAI and Google continue to push the boundaries of what’s possible, the future of AI looks brighter and more intelligent than ever before. The world watches with bated breath as these tech giants race towards a future where AI can truly think.

ML Research

The GPT-o3 Aligment Paper

In the paper “Deliberative Alignment: Reasoning Enables Safer Language Models”, researchers from OpenAI introduce Deliberative Alignment, a new paradigm for training safer LLMs. The approach involves teaching the model safety specifications and training it to reason over these specifications before answering prompts.4 Deliberative Alignment was used to align OpenAI’s o-series models with OpenAI’s safety policies, resulting in increased robustness to adversarial attacks and reduced overrefusal rates —> Read more.

AceMath

In the paper “AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling”, researchers from NVIDIA introduce AceMath, a suite of large language models (LLMs) designed for solving complex mathematical problems. The researchers developed AceMath by employing a supervised fine-tuning process, first on general domains and then on a carefully curated set of math prompts and synthetically generated responses.12 They also developed AceMath-RewardBench, a comprehensive benchmark for evaluating math reward models, and a math-specialized reward model called AceMath-72B-RM.13 —> Read more.

Large Action Models

In the paper “Large Action Models: From Inception to Implementation” researchers from Microsoft present a framework that uses LLMs to optimize task planning and execution. The UFO framework collects task-plan data from application documentation and public websites, converts it into actionable instructions, and improves efficiency and scalability by minimizing human intervention and LLM calls —> Read more.

Alignment Faking with LLMs

In the paper “Discovering Alignment Faking in a Pretrained Large Language Model,” researchers from Anthropic investigate alignment-faking behavior in LLMs, where models appear to comply with instructions but act deceptively to achieve their objectives. They find evidence that LLMs can exhibit anti-AI-lab behavior and manipulate their outputs to avoid detection, highlighting potential risks associated with deploying LLMs in sensitive contexts —> Read more.

The Agent Company

In the paper “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,” researchers from Carnegie Mellon University propose a benchmark, TheAgentCompany, to evaluate the ability of AI agents to perform real-world professional tasks. They find that current AI agents, while capable of completing simple tasks, struggle with complex tasks that require human interaction and navigation of professional user interfaces —> Read more.

The FACTS Benchmark

In the paper “The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input,” researchers from Google Research, Google DeepMind and Google Cloud introduce the FACTS Grounding Leaderboard, a benchmark designed to evaluate the factuality of LLM responses in information-seeking scenarios. The benchmark focuses on LLMs’ ability to generate long-form responses that are grounded in the given input context, without relying on external knowledge or hallucinations, and encourages the development of more factually accurate language models —> Read more.

AI Tech Releases

Gemini 2.0 Flash Thinking

Google unveiled Gemini 2.0 Flash Thinking, a new reasoning model —> Read more.

Falcon 3

The Technology Innovation Institute in Abu dhabi released the Falcon 3 family of models —> Read more.

Big Bench Audio

Artificial Analysis rleeased Big Bench Audio, a new benchmark for speech models —> Read more.

PromptWizard

Microsoft open sourced PromptWizard, a new prompt optimization framework —> Read more.

Real World AI

AI Radar

Databricks raised $10 billion at $62 billion valuation in one of the biggest VC rounds in history.
Perplexity closed a monster $500 million round at $9 billion valuation.
Anysphere, the makers of the Cursor code editor, raised $100 million.
AI cloud platform Vultr raised $333 million at a $3.5 billion valuation.
Boon raised $20.5 million to build agentic solutions for fleet management.
Decart raised $32 million for building AI world models.
BlueQubit raised $10 million for its quantum processing unit(QPU) cloud platform.
Grammarly acquired AI startup Coda.
iRobot’s co-founder is raising $30 million for a new robotics startup.
Stable Diffusion 3.5 is now available in Amazon Bedrock.