Alibaba QwQ Really Impresses at GPT-o1 Levels

The new model matches and surpasses GPT-o1 on reasoning tasks.

Created Using Midjourney

Next Week in The Sequence:

  • Edge 453: Explores cross modal distillation for building smaller multi-modal models. Expores a marquee paper from UC Berkeley in this area and dives into Hugging Face’s Gradio framework for building Web-AI applications.

  • The Sequence Chat: Debates the shift from pretraining to post-training in foundation models.

  • Edge 454: Dives into Microsoft’s new agentic framework for solving complex tasks.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Alibaba QwQ Really Impresses at GPT-o1 Levels

Two common debates in generative AI revolve around whether reasoning is the next frontier for foundation models and how competitive Chinese models will be with those from the West. This week, a release from Alibaba sheds light on both topics.

Since its initial release, GPT-o1 has been regarded as the most sophisticated model for long-term reasoning tasks. The model validated several key ideas in generative AI, such as the shift from pretraining to inference. Since then, many models have aimed to match GPT-01’s performance in reasoning tasks. Somewhat surprisingly, the most interesting challengers have come from China. Last week, DeepSeek showcased its R1 model, which matched GPT-01’s performance across several reasoning benchmarks. This week, it was Alibaba’s turn.

Alibaba’s latest addition to the Qwen family, Qwen with Questions (QwQ), is making waves in the AI community as a strong open-source competitor to OpenAI’s GPT-01 reasoning model. QwQ, currently available in a 32-billion-parameter preview version with a 32,000-token context, has already demonstrated impressive capabilities in benchmark tests. In both the AIME and MATH benchmarks, which evaluate mathematical problem-solving abilities, QwQ outperforms GPT-o1-preview. This achievement highlights the model’s strength in handling complex mathematical problems. Additionally, QwQ surpasses GPT-01-mini on GPQA, a benchmark focused on scientific reasoning, further showcasing its proficiency in understanding and responding to scientific queries. While QwQ lags behind GPT-o1 in the LiveCodeBench coding benchmark, it still outperforms other frontier models like GPT-4o and Claude 3.5 Sonnet, solidifying its position as a strong contender in the large reasoning model (LRM) landscape.

Alibaba’s philosophy behind QwQ emphasizes the importance of “patient inquiry” and “thoughtful analysis” in achieving true understanding. QwQ embodies this approach by engaging in a step-by-step reasoning process, akin to a student meticulously reviewing their work to identify and learn from mistakes. Examples showcased on the Qwen website demonstrate QwQ’s ability to “think aloud,” meticulously evaluating different possibilities and refining its approach as it tackles complex problems. This transparency offers valuable insights into the model’s reasoning mechanisms and underscores Alibaba’s commitment to promoting a deeper understanding of how LRMs function.

The emergence of LRMs like QwQ, R1, and GPT-o1 coincides with a growing realization that simply scaling model size might not be the most effective path to achieving artificial general intelligence. The pursuit of ever-larger models faces challenges, including diminishing returns on investment and increasing difficulty in acquiring high-quality training data. Inference-time scaling, the technique utilized by both QwQ and GPT-o1, presents a promising alternative. By focusing on enhancing reasoning through extended processing time, LRMs offer a potential breakthrough in AI development, potentially unlocking new levels of cognitive ability.

QwQ’s release marks a significant milestone in the evolution of AI, signaling a shift from traditional large language models (LLMs) towards LRMs that prioritize reasoning and problem-solving capabilities. Its open-source nature, impressive performance, and transparent “thinking process” are poised to accelerate advancements in the field, fostering a collaborative environment for researchers and developers to explore the full potential of LRMs. As this new class of AI models continues to mature, we can anticipate a future where AI systems not only mimic human language but also possess the capacity to reason, learn, and solve problems in ways once considered the exclusive domain of human intelligence.

And the Chinese are going to compete!


⭐️ Save your spot for SmallCon: A free virtual conference for GenAI builders! ⭐️

Join AI leaders from Meta, DoorDash, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, and more for deep-dive tech talks, interactive panel discussions, and live demos on the latest tech and trends in GenAI. You’ll learn firsthand how to build big with small models and architect the GenAI stack of the future.


🔎 ML Research

Marco-01

In “Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions,” researchers from the MarcoPolo Team at Alibaba International Digital Commerce introduce a large reasoning model (LRM) called Marco-o1, focusing on open-ended questions and solutions. Marco-o1 uses techniques like Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), and innovative reasoning strategies. They showcase enhanced reasoning capabilities compared to the base model Qwen2-7B-Instruct, demonstrated through improved accuracy on the MGSM datasets and successful translation of slang expressions —> Read more.

Star Attention

In “STAR ATTENTION: EFFICIENT LLM INFERENCE OVER LONG SEQUENCES,” researchers Shantanu Acharya and Fei Jia from NVIDIA introduce Star Attention, a two-phase, block-sparse attention mechanism for efficient LLM inference on long sequences. The method aims to improve computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. They highlight that the method integrates seamlessly with most Transformer-based LLMs trained with global attention and reduces memory requirements and inference time while maintaining accuracy —> Read more.

Multiphase Prompting

In “Advances in run-time strategies for next-generation foundation models,” researchers from Microsoft discuss run-time strategies, focusing on their work with Medprompt and their analysis of OpenAI’s o1-preview model. They explain that while Medprompt enhances GPT-4’s performance on specialized domains through multiphase prompting, o1-preview integrates run-time reasoning directly into its design using reinforcement learning. They analyze different prompting strategies with o1-preview and emphasize the need for new research directions and more challenging medical benchmarks —> Read more.

Hybrid Graph Sequence Models

In the paper “BEST OF BOTH WORLDS: ADVANTAGES OF HYBRID GRAPH SEQUENCE MODELS” researchers from Google Research and the New Jersey Institute of Technology introduce Graph Sequence Model (GSM), a framework for applying sequence models to graph data, and GSM++, a hybrid model that improves performance by tokenizing graphs into hierarchical sequences using the Hierarchical Affinity Clustering algorithm.1 GSM++ employs a hybrid architecture of Transformer to encode these sequences and combines the strengths of Transformer and recurrent models for effective graph learning —> Read more.

LLM as a Judge

In the paper “From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge” researchers from Arizona State University, University of Illinois Chicago, University of Maryland, Baltimore County, Illinois Institute of Technology, University of California, Berkeley, and Emory University introduce a comprehensive survey of the “LLM-as-a-judge” paradigm, exploring its use in various applications including evaluation, alignment, retrieval, and reasoning.2 The authors propose a taxonomy for LLM-as-a-judge based on input and output formats, attributes being judged, and methodologies employed, highlighting the potential and challenges of this emerging field —> Read more.

Time Series Analysis with Multimodal LLMs

In the paper “PLOTS UNLOCK TIME-SERIES UNDERSTANDING IN MULTIMODAL MODELS,” researchers from Google introduce a simple but effective method that leverages existing vision encoders of multimodal models to “see” time-series data via plots. This approach outperforms providing raw time-series data as text and reduces model API costs while offering data-driven insights for fields like healthcare, finance, and social sciences —> Read more.

🤖 AI Tech Releases

QwQ-32B

Alibaba QwQ-32B. a preview of its reasoning model —> Read more.

OLMo 2

Allen AI released OLMo2, a set of 7B adnd 13B models trained in 5 trillion tokens —> Read more.

Model Context Protocol

Anthropic open sourced the Model Context Protocol, a new standard for integrating AI assistants with data —> Read more.

SPDL

Meta AI open sourced SPDL, a new multi-threading framework for fast-data loading in AI training —> Read more.

SmolVLM

HuggingFace open sourced SmolVLM, a 2B parameter vision language model —> Read more.

🛠 Real World AI

Semantic Layer in Salesforce’s Data Cloud

Salesforce engineers discuss the AI techniques used to power the semantic querying engine in the Data Cloud platform —> Read more.

Data Segmentation at Airbnb

Airbnb engineers discuss the data segmentation techniques used to gather insights about patterns in supply availability —> Read more.

📡AI Radar

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.