The Sequence Radar #491: Red Teaming AI with AI

Anthropic’s new research discusses a method to use classifiers to protect LLMs.

Next Week in The Sequence:

We discuss Fusion RAG and its capabilities to extend RAG with sophisticated ranking techniques. We discuss the potential and limitations of continuous learning in foundation models. The engineering section dives into another awesome framework and we discuss large action models in our research edition.

You can subscribe to The Sequence below:

Editorial: Red Teaming AI with AI

Jailbreaks are one of the biggest headaches when it comes to large language models (LLMs). Everyone talks about guardrails as the go-to solution, but let’s be honest—they’re brittle, outdated almost as soon as they’re deployed, and hackers always find a way around them. So, what if we let AI guard itself? That’s exactly what Anthropic set out to explore in their research paper, “Constitutional Classifiers: Safeguarding Language Models from Universal Jailbreaks.”

Their idea is simple but powerful: build a constitution—a set of rules in natural language that defines what’s safe and what’s not. Then, use that constitution to generate synthetic data that trains a new kind of safeguard, called Constitutional Classifiers. These classifiers act as intelligent watchdogs, filtering both inputs and outputs to stop jailbreaks before they cause trouble. The ultimate goal? Keep LLMs secure without making them uselessly restrictive.

How It Works

At the core of this system is a dual-layer defense:

Input Classifier – Blocks sneaky attempts to bypass safety filters before they reach the model.
Streaming Output Classifier – Keeps an eye on responses, making sure nothing harmful slips through.

This setup is clever because it doesn’t just shut down anything remotely suspicious—it’s adaptive and nuanced. Plus, the classifiers are fine-tuned LLMs themselves, which makes them way more efficient than older filtering methods.

Can It Actually Work?

That’s the million-dollar question. Anthropic put their system to the test with some serious red teaming—over 3,000 hours of human testers trying to break it, plus automated adversarial attacks. The results? No red teamer could consistently extract harmful information at the same level as an unprotected model. That’s a huge win.

To push things further, the researchers used Automated Red Teaming (ART)—an AI-driven attack generator—and rubric grading, where a model evaluates jailbreak attempts based on predefined guidelines. The results showed impressive robustness, and the classifiers handled even unseen attack strategies surprisingly well.

Why This Matters

This research is a game-changer because it moves beyond static safety measures. The key takeaways:

Using a constitution-based framework makes AI safety policies clearer and more adaptable.
A dual-classifier approach makes it harder for jailbreakers to game the system.
Classifier performance scales with model size, dataset expansion, and smarter training methods.

Of course, no system is foolproof. The authors acknowledge that with enough time, new exploits will emerge. But this is a huge step forward in making AI safety mechanisms both effective and practical.

As AI gets more powerful, keeping it secure without over-restricting its capabilities is a delicate balancing act. Anthropic’s Constitutional Classifiers offer a smart, scalable way to handle this challenge. By letting AI police itself, we might finally have a defense that can evolve as fast as the threats it faces. This is the kind of thinking that could define the future of safe AI.

Everything you need to know about AI agents

Galileo just dropped a 100-page ebook on AI agents, so you can create powerful, reliable agents like an expert:

Match the right agentic framework for your use case
Evaluate and improve performance
Identify failure points and production issues

AI Research

Contitutional Classifiers

In the paper “Constitutional Classifiers: Safeguarding Language Models from Universal Jailbreaks”, researchers from Anthropic introduce a method of using classifiers trained with a constitution that defines harmful content to defend against universal jailbreaks that can extract harmful information from large language models. This approach uses a constitution to generate synthetic data to train classifiers which monitor both inputs and outputs and can block potentially harmful content.

Red Teaming for Robotics

In the paper “Embodied Red Teaming for Auditing Robotic Foundation Models”, researchers introduce Embodied Red Teaming (ERT) to identify diverse instructions that cause language-conditioned robot models to fail, demonstrating a gap between current benchmarks and real-world use cases. The ERT method uses a vision-language model to generate contextually grounded instructions and iteratively refines them based on robot execution results. The experiments show that state-of-the-art language-conditioned robot models fail or behave unsafely on ERT-generated instructions.

Hephaestus

In the paper “Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models Through Continual Pre-Training”, researchers introduce Hephaestus-Forge, a large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning, and adapting to environmental feedback. They use a three-stage process to create the corpus, which includes API documentation, action trajectories, and simulated agent data. By continual pre-training on Hephaestus-Forge, the resulting model, Hephaestus, shows improved agentic capabilities. The scaling law experiments identified an optimal data mix ratio of approximately 1:1:1 for agent, code, and text data.

ReasonFlux

In the paper“ReasonFlux: Hierarchical LLM Reasoning via Scaling Automated Thought Templates”, researchers present ReasonFlux, a framework that uses a library of thought templates to improve LLMs’ mathematical reasoning capabilities. ReasonFlux uses hierarchical reinforcement learning to plan out optimal thought template trajectories for a given problem and dynamically selects appropriate templates for each sub-problem during inference. The framework uses a structured and compact template library along with efficient retrieval methods. Experiments show that ReasonFlux outperforms strong baselines and demonstrates effective generalization across various math benchmarks.

Matryoshka Quantization

In the paper“Matryoshka Quantization”, researchers propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique for training and maintaining just one model that can be served at different precision levels. MatQuant uses a nested (Matryoshka) structure of integer data types, where smaller bit-width integers are nested within the most significant bits, enabling different quantization levels (int8, int4, int2, and interpolated bit-widths) from a single trained model. The experiments demonstrate that MatQuant can achieve up to 10% more accuracy in int2 precision compared to other methods, while also providing flexibility for adapting to various serving constraints.

Competitive Programming with LLMs

In the paper“Competitive Programming with Large Reasoning Models”, researchers at OpenAI demonstrate that reinforcement learning significantly improves the performance of large language models (LLMs) on complex coding and reasoning tasks. The paper compares the performance of the general-purpose model o3 with the domain-specific model o1-ioi, showing that o3 achieves top-tier performance on competitive programming benchmarks without relying on hand-engineered strategies, and obtains a CodeForces rating on par with elite human competitors.

Mechanism Design for LLMs

In the paper “Mechanism Design for Large Language Models”, researchers from Google Research and the University of Chicago investigate auction mechanisms to support the emerging format of AI-generated content, proposing a token auction model for aggregating outputs from multiple LLMs. They explore both robust auction designs based on partial orders of outcome distributions, and concrete aggregation functions based on KL-divergence, demonstrating the feasibility of their approach with experimental results.