The Sequence Radar #664: The Gentle Singularity Is Already Here

Plus a major AI move by Meta AI and new releases from Apple, Mistral and OpenAI.

Next Week in The Sequence:

In our series about evals, we dive into AGI benchmarks. The research section dives into Meta AI’s JEPA-2 model. Don’t miss our opinion section as we are going to discuss the famous superposition hypothesis in AI interpretability. Engineering dives into the world of AI sandbox environments.

You can subscribe to The Sequence below:

📝 Editorial: The Gentle Singularity Is Already Here

In a recent and quietly radical blog post titled “The Gentle Singularity,” OpenAI CEO Sam Altman dropped a thesis that reads more like a plot twist than a prediction: the singularity isn’t coming—it’s already arrived. Forget the apocalyptic drama of rogue AIs and sci-fi rebellion; this version of the future is calm, smooth, and deeply weird. It’s not a bang but a gradient—one we’ve been sliding down without realizing.

Altman lays out a timeline that feels less like prophecy and more like an insider’s itinerary. Right now, AI systems are churning through cognitive labor with the kind of stamina that would make any grad student jealous. By 2026, he expects them to generate real, novel scientific discoveries; by 2027, robots should be reliably navigating the physical world. If that sounds wild, it’s worth remembering that most of us didn’t expect generative models to go from autocomplete toys to research partners in under five years either.

What makes this singularity “gentle” is its deceptive normalcy. People still walk their dogs, drink their coffee, and swipe through social media. But under the hood, AI is reshaping how work gets done. Coders are using copilots to write functions they barely touch. Scientists are fast-tracking ideas with AI-aided literature reviews and simulations. Designers are skipping mood boards in favor of generating full prototypes. It’s not flashy—but it’s everywhere.

One of the spiciest sections in Altman’s essay explores recursive acceleration: systems that build better versions of themselves, powered by increasingly autonomous infrastructure. Imagine an intelligence supply chain that bootstraps itself—data centers run by robots, trained by AIs, serving other AIs. If intelligence becomes as cheap and abundant as electricity, the result isn’t just economic growth—it’s epistemological upheaval.

Of course, it’s not all silicon utopias and self-replicating insight engines. Altman puts his finger on the two pressure points: alignment and access. In other words: Will these systems want what we want? And who gets to use them? His optimism about “good governance” is noble, but critics rightfully worry that current institutions are too slow and too fractured to manage this transition. A gentle singularity doesn’t mean a safe one.

Altman’s essay is part vision, part provocation—a call to update our mental models. No, the streets aren’t full of humanoid robots (yet), but that doesn’t mean the singularity is fiction. It just looks different than expected. As we move through this inflection point, the challenge isn’t to brace for impact—it’s to take the wheel. The future is arriving at walking speed, and it’s asking if we’re paying attention.

🔎 AI Research

📘1. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Lab: FAIR at Meta + Mila / Polytechnique Montréal
V-JEPA 2 is a large-scale self-supervised video model trained on over 1 million hours of internet video. It delivers state-of-the-art results in motion understanding (e.g., 77.3% top-1 on SSv2), human action anticipation (39.7 recall@5 on Epic-Kitchens), and video question-answering. The model can be extended into V-JEPA 2-AC, which uses just 62 hours of unlabeled robot data to perform zero-shot pick-and-place tasks using planning—without any fine-tuning or task-specific rewards. This work demonstrates how scalable video pretraining, combined with lightweight robot interaction data, can produce world models that understand, predict, and act in the physical world.

📘 2. EXPERTLONGBENCH: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Lab: University of Michigan + Carnegie Mellon
EXPERTLONGBENCH introduces a new benchmark with 11 real-world, domain-specific tasks (e.g., law, healthcare, chemistry) that require long-form outputs (5K+ tokens) and rigorous accuracy. It proposes CLEAR, a structured checklist-based evaluation system co-designed with experts, which reveals that current LLMs—despite their scale—perform poorly on these tasks (best F1 = 26.8%). This work raises the bar for what it means to evaluate “expert” reasoning and opens a path toward fine-grained, reliable assessment of long outputs.

📘 3. Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Lab: IBM Research + Hebrew University of Jerusalem + AI2
This paper evaluates LLMs not just as generators but as judges—using debate speeches annotated by human experts. It finds that while large models can often agree with human judgments, they systematically score more harshly and deviate in style and preferences. Intriguingly, speeches generated by GPT-4.1 were rated higher than those written by expert human debaters—showing that LLMs are not only capable of judging debates, but also of winning them.

📘 4. Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Lab: Carnegie Mellon, UIUC, UC Berkeley, NYU, The AGI Company
This paper argues that agents should not just “think harder” (i.e., generate longer chains of thought), but instead interact more during deployment. The authors propose a new axis of test-time scaling: letting agents act longer and replan mid-task. Their curriculum-trained agent (TTI) using a 12B Gemma model outperforms previous methods on WebArena and WebVoyager by enabling more exploration and backtracking—suggesting that longer interaction, not longer thoughts, is often the real key to success.

📘 5. Institutional Books 1.0: A 242B Token Dataset from Harvard Library’s Collections

Lab: Harvard Library
This project releases a massive 242-billion-token dataset made from nearly 1 million public-domain books scanned by Harvard Library. Beyond scale, the dataset is meticulously curated—featuring OCR cleanup, metadata enrichment, and detailed topic classification. It’s a major contribution to open-source LLM training resources, offering an unprecedented historical and multilingual corpus, particularly valuable for training models with long-context reasoning capabilities.

📘 6. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Lab: Apple
This study reveals a sobering reality about “thinking” LLMs (like Claude, DeepSeek, GPT-o3): their performance collapses completely when faced with high-complexity reasoning puzzles, despite increased token budgets. Surprisingly, non-thinking models often outperform at low complexity, and reasoning effort decreases as problems get harder—suggesting inefficiencies and brittleness in current LRM architectures. Using controllable puzzles like Tower of Hanoi, the paper questions whether current “reasoning traces” truly reflect generalizable problem-solving or just elaborate pattern-matching.

🤖 AI Tech Releases

Magistral

Mistral released Magistral, its first reasoning model.

o3-Pro

OpenAI released o3-pro, a new version of its o3 model optimized for longer reasoning tasks.

Apple Releases

Apple announced a series of AI releases at WWDC25.

🛠 AI in Production

Ad Retrieval at Pinterest

Pinterest discusses the AI techniques use for ad retrieval in its platform.

📡AI Radar

Sam Altman published an insightful post about the future of AI.
Meta AI made a massive investment in Scale AI. Alexander Wang is joining Meta’s new superintelligence team.
Enterprise AI search platform Glean raised $150 million at a $7.2 billion valuation.
AI sales automation platform Clay raised a new round at a $3 billion valuation.
Outset raised $17 million Series Ato scale its AI-powered platform for automating enterprise market research interviews.
Multiverse computing raised $215M for its LLM compression platform.
Wandercraft secured $75 million Series D funding to expand its AI-driven humanoid robotics and exoskeleton innovations.
Lemony launched its plug-and-play “AI-in-a-box” on-premise hardware and announced a $2 million seed round led by True Ventures.
Zip introduced 50 specialized AI agents to automate procurement workflows, citing early adoption by OpenAI and Canva (venturebeat.com).
Vanta debuted an AI agent for compliance automation, backed by its prior $150 million Series C funding (venturebeat.com).
Coco Robotics raised $80 million in new funding.