The Sequence Radar # : The Amazing AlphaGeometry2 Now Achieved Gold Medalist in Math Olympiads

DeepMind was able to improve the model just a few months after its first release.

Next Week in The Sequence:

Our series about RAG continues with an expxloration of Self-RAG. The engineering section dives into Txtai, a new framework for LLM workflows. In research we are going to dive into DeepSeek-R1( finally). And in the opinion section we will discuss another controversial topics in AI.

You can subscribe to The Sequence below:

Editorial: The Amazing AlphaGeometry2 Now Achieved Gold Medalist in Math Olympiads

DeepMind’s journey toward mathematical AI dominance took a major leap last year when AlphaProof and AlphaGeometry nearly clinched gold at the International Math Olympiad (IMO). Now, with the latest upgrade, AlphaGeometry2 (AG2) has officially surpassed top human competitors in geometry, marking a milestone in AI-driven mathematical reasoning. The general consensus among IMO competitors is that geometry problems are among the toughest in each day of the Olympiad.

AlphaGeometry2 (AG2), an improved version of AlphaGeometry, was released in early 2025 and has demonstrated gold-medalist level performance in solving Olympiad geometry problems. The system builds upon its predecessor by expanding its domain-specific language to handle more complex problems, including those with object movements and linear equations involving angles, ratios, and distances. The coverage rate of the AG2 language on International Math Olympiad (IMO) geometry problems from 2000-2024 increased from 66% to 88%. Furthermore, AG2 utilizes a Gemini architecture for better language modeling and incorporates a knowledge-sharing mechanism that combines multiple search trees, improving its overall solving rate to 84% on IMO geometry problems from the past 25 years, compared to 54% previously. This enhanced performance has allowed AG2 to surpass an average IMO gold medalist. The system also achieved a silver-medal standard at IMO 2024.

The key improvements in AG2 can be attributed to several factors. The domain language was expanded to cover locus-type theorems, linear equations, and non-constructive problem statements. A stronger and faster symbolic engine was developed, featuring an optimized rule set, added handling of double points, and a faster implementation in C++. The system utilizes an advanced novel search algorithm that employs multiple search trees with knowledge sharing. An enhanced language model, leveraging the Gemini architecture and trained on a larger and more diverse dataset, was also implemented. The original AlphaGeometry (AG1) used a domain-specific language with nine basic predicates. AG2 now includes additional predicates to improve its handling of angle, ratio, and linear equation problems, expanding its mathematical understanding. Furthermore, AG2 introduces eleven locus cases with corresponding predicate syntax to handle movements of objects. To support topological/non-degeneracy conditions, AG2 incorporates predicates for diagram checks. The expansion of the domain language allowed AG2 to cover 88% of all 2000-2024 IMO geometry problems.

AG2’s symbolic engine, named DDAR (Deductive Database Arithmetic Reasoning), has also been greatly enhanced. The symbolic engine computes the deduction closure, which is the set of all deducible facts from a core set of initial facts. AG2 incorporates the capability of handling double points, which allows the system to reason about points with different names but the same coordinates. The algorithm has been made more efficient by hard-coding the search for essential rules, which has reduced the number of queries for the AR sub-engine to at most cubic. A new algorithm, DDAR2, was designed to make the search for similar triangles and cyclic quadrilaterals faster. The core computation of DDAR was implemented in C++, achieving a 300x speed improvement. The enhanced symbolic engine is crucial for both training data generation and proof search.

AG2’s training data was improved by scaling up resources, exploring more complex random diagrams, generating more complex theorems and proofs, and creating a more balanced distribution of question types and problems with and without auxiliary points. The data generating algorithm also produces problems of the “locus” type, which was not supported in AG1. The data generation algorithm was also made faster using a greedy discarding algorithm. The new search algorithm, SKEST (Shared Knowledge Ensemble of Search Trees), employs multiple search trees with different configurations running in parallel and sharing facts, and multiple language models for each search tree configuration are used to improve system robustness. The language model itself is a sparse mixture-of-expert Transformer-based model that leverages the Gemini training pipeline. AG2 also utilizes a more sophisticated neuro-symbolic interface by providing the language model with additional information about deductions made by DDAR. Through these advancements, AlphaGeometry2 represents a significant step forward in AI’s ability to tackle challenging mathematical reasoning tasks.

AG2’s performance suggests we are on the cusp of AI surpassing human capabilities in competitive mathematics—an achievement that could pave the way for advancements in broader scientific and logical reasoning tasks. While AG2 demonstrates AI’s ability to master geometry, similar breakthroughs in physics and chemistry Olympiads remain unexplored. These fields introduce additional challenges such as experimental validation and real-world data interpretation, but AG2’s success suggests that similar neuro-symbolic approaches could be adapted for broader scientific discovery.

AI Research

SafeRAG

In the paper“SafeRAG: A Security Evaluation Benchmark for Retrieval-Augmented Generation”, researchers from several AI labs introduce SafeRAG, a benchmark designed to evaluate the security vulnerabilities of Retrieval-Augmented Generation (RAG) systems against data injection attacks. The study identifies four critical attack surfaces—noise, conflict, toxicity, and Denial-of-Service (DoS)—and demonstrates significant weaknesses in the retriever, filter, and generator components of RAG pipelines.

Self-MoA

In the paper“Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?”, researchers from Princeton University introduce Self-MoA, an ensemble method that aggregates outputs from a single top-performing Large Language Model (LLM), which surprisingly outperforms standard Mixture-of-Agents (MoA) that combines different LLMs. The paper also presents Self-MoA-Seq, a sequential version of Self-MoA that iteratively aggregates outputs, and their findings highlight that MoA performance is sensitive to model quality.

Transformers and RL

In the paper “Improving Transformer World Models for Data-Efficient RL”, researchers from Google DeepMind present improvements to vision-based Model-Based Reinforcement Learning (MBRL) agents that use transformer world models for background planning. Key contributions include training policies on both real and imagined trajectories, implementing a nearest-neighbor tokenizer (NNT) for patches, and using block teacher forcing (BTF) to train the world model, ultimately achieving higher rewards than previous state-of-the-art methods on the Craftax-classic benchmark.

Chain-of-Action-Thought

In the paper “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search“, researchers introduce the Chain-of-Action-Thought (COAT) mechanism, which enables Large Language Models (LLMs) to perform meta-actions during problem-solving, using a novel two-stage training paradigm involving format tuning and reinforcement learning with “Restart and Explore” (RAE) techniques. This approach results in Satori, a 7B LLM, which shows strong performance on both in-domain and out-of-domain tasks, leveraging a multi-agent framework for generating high-quality reasoning trajectories.

ZebraLogic

In the paper“ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning”, researchers from the University of Washington, Allen Institute for AI, and Stanford University introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). This framework enables the generation of puzzles with controllable and quantifiable complexity to study the scaling limits of models such as Llama, o1, and DeepSeek-R1. The study reveals a significant decline in accuracy as problem complexity grows, termed the “curse of complexity,” even with larger models and increased inference-time computation.

Edge LLMs

In a blog post titled “Advances to low-bit quantization enable LLMs on edge devices”, researchers from Microsoft Research discuss how low-bit quantization can enable the deployment of large language models (LLMs) on edge devices by compressing models and reducing memory demands. They developed three techniques: Ladder, T-MAC, and LUT Tensor Core, to address challenges in mixed-precision matrix multiplication (mpGEMM) and improve computational efficiency for LLMs on resource-constrained devices. These innovations include a data type compiler, a table lookup method, and a hardware design for low-bit LLM inference.