The Sequence Radar #481: Humanity’s Last Exam

One of the most novel and toughest benchmarks for gen AI.

Next Week in The Sequence:

We continue our series about RAG with an overview of corrective RAG. Our engineering edition will discuss a cool new framework that is making quite a bit of noise. In research, we have no choice but to dive into DeepSeek-R1. In our opinion section we discuss what’s wrong with AI evaluations today.

You can subscribe to The Sequence below:

📝 Editorial: Humanity’s Last Exam

We need new evals!” That is a common mantra in generative AI. The top AI models are constantly outpacing and memorizing leading benchmarks, triggering a race to develop more challenging evaluations that push the boundaries of foundation models. Last week, we saw a great addition to that roster.

The Humanity’s Last Exam (HLE) benchmark is a novel, multi-modal evaluation suite designed to assess the limits of large language model (LLM) capabilities on closed-ended academic questions. It addresses the issue of benchmark saturation, where state-of-the-art LLMs achieve near-perfect scores on existing evaluations like MMLU, hindering precise measurement of AI progress. HLE consists of 3,000 challenging questions across a wide range of subjects, including mathematics, humanities, and natural sciences. The questions are developed by subject-matter experts globally and are designed to be resistant to simple internet lookups or database retrievals, emphasizing original, precise, and unambiguous content. HLE aims to be the final closed-ended academic benchmark of its kind, providing a clear measure of the gap between current AI capabilities and expert human knowledge.

A key differentiator of HLE is its rigorous question development and review process. Each question undergoes multiple stages of scrutiny, including an initial check against state-of-the-art LLMs, and is rejected if LLMs can answer it correctly. Following this initial check, the questions proceed to a two-stage human review. The first review round involves multiple graduate-level reviewers who iteratively refine the questions. The second round is conducted by organizers and expert reviewers who approve questions based on quality and adherence to submission criteria. This multi-stage review process ensures that only the most challenging and high-quality questions are included in the benchmark. Additionally, all questions must have a known solution that is unambiguous and easily verifiable. This meticulous approach to question creation is a key contribution, as it helps ensure that the benchmark measures advanced reasoning and knowledge, rather than susceptibility to memorization or retrieval.

Another key contribution of the HLE benchmark lies in its diverse question formats and subject coverage. The benchmark includes both exact-match and multiple-choice questions, as well as multi-modal questions that require comprehending both text and image references. This variety of formats ensures that models are evaluated across a broader range of skills. Furthermore, HLE spans a wide array of academic subjects, from STEM fields to law, history, and the arts. This breadth of subject matter ensures that the benchmark is a holistic measure of overall academic ability. By incorporating this wide variety of questions, HLE moves beyond subject-specific tests, aiming to provide a more complete assessment of an LLM’s knowledge and problem solving capabilities.

The evaluation results of HLE demonstrate its efficacy as a challenging benchmark. State-of-the-art LLMs consistently show low accuracy (less than 10%) and poor calibration on HLE, indicating a substantial gap between current model capabilities and expert-level performance. Models often provide incorrect answers with high confidence rather than acknowledging their uncertainty, which highlights the problem of hallucination. This level of difficulty contrasts with the saturation seen in many existing benchmarks, demonstrating the utility of HLE in assessing frontier AI capabilities. Furthermore, the evaluation setup includes a standardized system prompt that structures model responses, as well as GPT-4O as a judge to verify answer correctness, ensuring consistency and objectivity.

In conclusion, HLE is a significant contribution to the field of AI benchmarking. Its focus on challenging, original questions, rigorous review process, and broad subject coverage distinguish it from existing benchmarks. The benchmark provides a clear measure of AI capabilities at the frontier of human knowledge. The low accuracy and poor calibration demonstrated by current LLMs underscore the need for continued advancements in AI development, ensuring we are able to accurately measure AI progress. The public release of HLE aims to serve as a common reference point for researchers and policymakers. Although HLE is designed to be the final closed-ended academic benchmark, it does not evaluate open-ended research capabilities, and so is not the final benchmark for AI.

🔎 AI Research

Chain of Agents

In the paper“Chain of Agents: Large Language Models Collaborating on Long-Context Tasks”, researchers from Penn State University and Google Cloud AI Research introduce a novel framework called Chain-of-Agents (CoA) that uses multiple collaborating agents to process long-context tasks, improving performance over strong baselines like RAG and Full-Context approaches. CoA mitigates long-context focus issues by having worker agents sequentially handle different parts of the input text, and then using a manager agent to synthesize the results.

Humanity’s Last Exam

In the paper “HUMANITY’S LAST EXAM”, AI researchers developed a challenging multi-modal benchmark called HLE, consisting of 3,000 questions across various subjects, designed to assess the limits of LLM capabilities, with the goal of creating a resource that tracks AI progress for scientists and policymakers. The HLE benchmark aims to address the fact that current LLMs can achieve high accuracy on existing benchmarks.

Qwen 2.5-Max

In the paper “Qwen2.5: Advancing LLMs to 1 Million Context Length”, the Qwen team presents the Qwen2.5 model, which extends the context length of Large Language Models to one million tokens and demonstrates significant improvements on long-context tasks while maintaining performance on short-context benchmarks. The researchers evaluated the model using benchmarks such as RULER and LV-Eval to assess the model’s ability to understand and process long sequences of text.

Mechanistic Interpretability

In the paper “Mechanistic Interpretability: Open Problems and the Road Ahead”, researchers from Anthropic, King’s College London, Imperial College London, MATS, MIT, Northeastern University, Tel Aviv University, Goodfire, Timaeus, University of Melbourne, METR and Pr(AI)2r group discuss the current frontier of mechanistic interpretability, its open problems, and future research directions that are necessary to realize the benefits of the field. The review emphasizes the importance of developing methods to understand the inner workings of neural networks, including identifying task-relevant subgraphs and iteratively describing the function of individual components.

RL vs. SFT

In the paper “Generalization vs Memorization: A Comparative Study of Supervised Fine-tuning and Reinforcement Learning on LLM and VLM”, researchers from UC Berkeley, Google DeepMind, NYU and other institutions explore the generalization capabilities of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on both Large Language Models (LLMs) and Vision Language Models (VLMs) in two tasks: GeneralPoints and V-IRL, and show that RL demonstrates superior generalization while SFT can help stabilize output formats. The study highlights that RL is better at learning generalizable rules that can be applied to unseen tasks.

Selene Mini

In the paper “Atla Selene Mini: A General Purpose Evaluation Model”, researchers fromatla, University College London and Cohere introduce SLMJ, an LLM-as-aJudge model used to judge other language models, and find that the Atla Selene Mini model achieved the highest overall performance. The paper also examines the performance of other models in the context of their capabilities to evaluate LLM responses across multiple benchmarks.