The Sequence Radar #506: Honor to Whom Honor is Due: AI Won the Nobel Prize of Computing

Some of the pioneers in reinforcement learning received the top award in computer science.

Next Week in The Sequence:

Our series about RAG continues with an explanation of multimodal RAG and a review of the ColPali research to enable RAG in vision models. The research edition discusses Microsoft’s amazing Muse models that can create entire video game sequences. The opinion section will explore a controversial idea: is RAG dying? We will also discuss a new cool tech stack in our engineering section.

You can subscribe to The Sequence below:

📝 Editorial: Honor to Whom Honor is Due: AI Won the Nobel Prize of Computing

AI has been honored with the “Nobel Prize” of computer science. For those of us who have been in the AI field for a long time, last week brought joy as two of the most brilliant original thinkers in the space received well-deserved recognition.

The 2024 ACM A.M. Turing Award, often referred to as the “Nobel Prize of computing,” has been awarded to Andrew G. Barto and Richard S. Sutton for their groundbreaking contributions to reinforcement learning (RL). These pioneers have laid the conceptual and algorithmic foundations of RL, shaping the future of artificial intelligence and decision-making systems. Their seminal work, including the influential textbook Reinforcement Learning: An Introduction, published in 1998, has been cited over 75,000 times and remains the standard reference in the field.

Barto and Sutton’s research has been instrumental in developing modern computational approaches to RL, which tackle the challenge of learning how to act based on evaluative feedback. Their work spans multiple disciplines, including computer science, engineering, mathematics, neuroscience, psychology, and economics. Beyond academia, their contributions have significantly impacted real-world applications, with RL now playing a crucial role in numerous industries.

One of RL’s most notable early successes was demonstrated by Google DeepMind’s AlphaGo, which defeated world-class human Go players in 2016 and 2017. This achievement highlighted RL’s potential when combined with deep learning techniques, paving the way for deep reinforcement learning. Since then, RL has been applied in diverse fields such as robotics, automated trading, and game-playing algorithms.

Despite its successes, RL still faces several challenges that researchers continue to address. These include the exploration-exploitation dilemma, sample efficiency, reward design complexity, and generalization issues. Additionally, RL algorithms often require high computational resources, especially when simulating complex environments or processing high-dimensional data. The lack of explainability in RL models also raises concerns in critical applications such as healthcare and autonomous systems.

In recent years, the intersection of RL with foundation models has opened up new avenues for research and application. Foundation models, trained on broad datasets using large-scale self-supervision, can be adapted to a wide range of downstream tasks. The integration of RL techniques with foundation models has led to innovations such as reinforcement learning from human feedback (RLHF), which plays a key role in the development of advanced language models like ChatGPT.

Looking ahead, RL continues to evolve and find new applications in the era of foundation models. Researchers are exploring ways to leverage these models to enhance RL’s efficiency in robotic manipulation and other tasks. The combination of RL with foundation models holds promise for addressing long-standing challenges such as sample efficiency and generalization. With ongoing advancements and the potential for further breakthroughs, the work of Barto and Sutton remains at the forefront of AI research, driving progress in machine learning and artificial intelligence.

🔎 AI Research

Code Arena

In the paper “CodeArena: A Collective Evaluation Platform for LLM Code Generation” researchers from Nanyang Technological University, National University of Singapore, The University of Hong Kong, Monash University, and ByteDance introduce CodeArena, an online framework for evaluating LLM code generation using a collective evaluation mechanism that dynamically adjusts model scores to mitigate biases from benchmark leakage. The platform also provides open access to solutions, test cases, and automation-friendly APIs to streamline code evaluation.

LLM Cognitive Primitives

In the paper“Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs” researchers from Stanford University and SynthLabs investigate the intrinsic properties that enable effective self-improvement in language models, analyzing cognitive behaviors such as verification, backtracking, subgoal setting, and backward chaining. The study finds that models exhibiting these reasoning behaviors from the outset can achieve substantial improvements through reinforcement learning.

Better Instruction Tuning

In the paper “Large-Scale Data Selection for Instruction Tuning” researchers from University of Washington and Allen Institute for AI present a systematic study on how data selection methods scale for instruction-tuning language models, selecting up to 2.5M samples from pools of up to 5.8M samples. They found that a variant of representation-based data selection (RDS+) consistently outperforms more complex methods across all settings while being more compute-efficient.

MultiAgentBench

In the paper“MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents” researchers from University of Illinois Urbana-Champaign introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios, measuring task completion and the quality of collaboration and competition. The framework uses milestone-based key performance indicators and evaluates various coordination protocols and strategies.

Union of Experts

In the paper “Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer”, ressearchers from from Dalian University of Technology introduce Union-of-Experts (UoE), which decomposes a transformer into an equitant group of experts and implements selective routing on input data and experts, enhancing model performance and computational efficiency. The UoE model incorporates innovations such as equitant expert decomposition, patch-wise data selection, expert selection strategies, and parallel implementation, demonstrating superior performance in image and natural language tasks compared to full attention models and other MoEs.

START

In the paper “START: Self-taught Reasoner with Tools”, researchers from Alibaba Group introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that enhances reasoning by leveraging external tools. START uses a self-learning framework with Hint-infer and Hint Rejection Sampling Fine-Tuning (Hint-RFT) to achieve high accuracy on PhD-level science QA, competition-level math benchmarks, and the competition-level code benchmark.