The model is pushing the boundaries of algorithmic discovery.
Next Week in The Sequence:
We are going deeper into DeepMind’s AlphaEvolve. The knowledge section continues with our series about evals by diving into multimodal benchmarks. Our opinion section will discuss practical tips about using AI for coding. The engineering will review another cool AI framework.
You can subscribe to The Sequence below:
📝 Editorial: The Amazing AlphaEvolve
DeepMind has done it away and shipped another model that pushes the boudaries of what we consider possible with AI. AlphaEvolve is a groundbreaking AI system that redefines algorithm discovery by merging large language models with evolutionary optimization. It builds upon prior efforts like AlphaTensor, but significantly broadens the scope: instead of evolving isolated heuristics or functions, AlphaEvolve can evolve entire codebases. The system orchestrates a feedback loop where an ensemble of LLMs propose modifications to candidate programs, which are then evaluated against a target objective. Promising solutions are preserved and recombined in future generations, driving continual innovation. This architecture enables AlphaEvolve to autonomously invent algorithms of substantial novelty and complexity.
One of AlphaEvolve’s most striking contributions is a landmark result in computational mathematics: the discovery of a new matrix multiplication algorithm that improves upon Strassen’s 1969 breakthrough. For the specific case of 4×4 complex-valued matrices, AlphaEvolve found an algorithm that completes the task in only 48 scalar multiplications, outperforming Strassen’s method after 56 years. This result highlights the agent’s ability to produce not only working code but mathematically provable innovations that shift the boundary of known techniques. It offers a glimpse into a future where AI becomes a collaborator in theoretical discovery, not just an optimizer.
AlphaEvolve isn’t confined to abstract theory. It has demonstrated real-world value by optimizing key systems within Google’s infrastructure. Examples include improvements to TPU circuit logic, the training pipeline of Gemini models, and scheduling policies for massive data center operations. In these domains, AlphaEvolve discovered practical enhancements that led to measurable gains in performance and resource efficiency. The agent’s impact spans the spectrum from algorithmic theory to industrial-scale engineering.
Crucially, AlphaEvolve’s contributions are not just tweaks to existing ideas—they are provably correct and often represent entirely new approaches. Each proposed solution is rigorously evaluated through deterministic testing or benchmarking pipelines, with only high-confidence programs surviving the evolutionary loop. This eliminates the risk of brittle or unverified output. The result is an AI system capable of delivering robust and reproducible discoveries that rival those of domain experts.
At the core of AlphaEvolve’s engine is a strategic deployment of Gemini Flash and Gemini Pro—models optimized respectively for high-throughput generation and deeper, more refined reasoning. This combination allows AlphaEvolve to maintain creative breadth without sacrificing quality. Through prompt engineering, retrieval of prior high-performing programs, and an evolving metadata-guided prompt generation process, the system effectively balances exploration and exploitation in an ever-growing solution space.
Looking ahead, DeepMind aims to expand access to AlphaEvolve through an Early Access Program targeting researchers in algorithm theory and scientific computing. Its general-purpose architecture suggests that its application could scale beyond software engineering to domains like material science, drug discovery, and automated theorem proving. If AlphaFold represented AI’s potential to accelerate empirical science, AlphaEvolve points toward AI’s role in computational invention itself. It marks a paradigm shift: not just AI that learns, but AI that discovers.
🔎 AI Research
AlphaEvolve
AlphaEvolve is an LLM-based evolutionary coding agent capable of autonomously discovering novel algorithms and improving code for scientific and engineering tasks, such as optimizing TPU circuits or discovering faster matrix multiplication methods. It combines state-of-the-art LLMs with evaluator feedback loops and has achieved provably better solutions on several open mathematical and computational problems.
Continuous Thought Machines
This paper from Sakana AI introduces the Continuous Thought Machine (CTM), a biologically inspired neural network architecture that incorporates neuron-level temporal dynamics and synchronization to model a time-evolving internal dimension of thought. CTM demonstrates adaptive compute and sequential reasoning across diverse tasks such as ImageNet classification, mazes, and RL, aiming to bridge the gap between biological and artificial intelligence.
DarkBench
DarkBench is a benchmark designed to detect manipulative design patterns in large language models—such as sycophancy, brand bias, and anthropomorphism—through 660 prompts targeting six categories of dark behaviors. It reveals that major LLMs from OpenAI, Anthropic, Meta, Google, and Mistral frequently exhibit these patterns, raising ethical concerns in human-AI interaction.
Sufficient Context
This paper proposes the notion of “sufficient context” in RAG systems and develops an autorater that labels whether context alone is enough to answer a query, revealing that many LLM failures arise not from poor context but from incorrect use of sufficient information. Their selective generation method improves accuracy by 2–10% across Gemini, GPT, and Gemma models by using sufficiency signals to guide abstention and response behaviors.
Better Interpretability
General Scales Unlock AI Evaluation with Explanatory and Predictive Power– University of Cambridge, Microsoft Research Asia, VRAIN-UPV, ETS, et al.
This work presents a new evaluation framework using 18 general cognitive scales (DeLeAn rubrics) to profile LLM capabilities and task demands, enabling both explanatory insights and predictive modeling of AI performance at the instance level. The framework reveals benchmark biases, uncovers scaling behaviors of reasoning abilities, and enables interpretable assessments of unseen tasks using a universal assessor trained on demand levels.
J1
This paper introduces J1, a reinforcement learning framework for training LLMs as evaluative judges by optimizing their chain-of-thought reasoning using verifiable reward signals. Developed by researchers at Meta’s GenAI and FAIR teams, J1 significantly outperforms state-of-the-art models like EvalPlanner and even larger-scale models like DeepSeek-R1 on several reward modeling benchmarks, particularly for non-verifiable tasks.
🤖 AI Tech Releases
Codex
OpenAI unveiled Codex, a cloud software engineering agent that can work on many parallel tasks.
Windsurf Wave
AI coding startup Windsurf announced its first generation of frontier models.
Stable Audio Open Small
Stability AI released a new small audio model that can run in mobile devices.
📡AI Radar
-
Databricks acquired serverless Postgres platform Neon for $1 billion.
-
Saudi Arabia Crown Prince unveiled a new company focused on advancing AI technologies in the region.
-
Firecrawl is ready to pay up to $1 million for AI agent employees.
-
Cognichip, an AI platform for chip design, emerged out of stealth with $33 million in funding.
-
Legal AI startup Harvey is in talks to raise $250 million.
-
TensorWave raised $100 million to build an AMD cloud.
-
Google Gemma models surpassed the 150 million downloads.