Cerebras Inference and the Challenges of Challenging NVIDIA’s Dominance

Why does NVIDIA remains virtually unchallenged in the AI chip market?

Next Week in The Sequence:

Edge 427: Our series about state space models(SSM) continues with a review of AI21’s Jamba, a model that combines transformers and SSMs. We discuss Jamba’s original research paper and the DeepEval framework.
Edge 428: We dive into PromptPoet, Character.ai’s framework for prompt optimization.

You can subscribe to The Sequence below:

📝 Editorial: Cerebras Inference and the Challenges of Challenging NVIDIA’s Dominance

AI hardware is experiencing an innovation renaissance, with well-funded startups emerging everywhere. Yet, NVIDIA remains virtually unchallenged, holding close to a 90% share of the AI chip market. Why is that?

We’ve all heard explanations about the advantages of NVIDIA’s software stack for acceleration compared to platforms like AMD’s, which seems like a lazy explanation for why NVIDIA is out-innovating its competitors. A simple theory that I’ve discussed with several scientists and engineers who pretrain large foundation models is that NVIDIA is the only platform receiving regular feedback about the performance of chips during pretraining runs with tens of thousands of GPUs. It turns out that at that scale, many challenges arise that are nearly impossible to simulate on a smaller scale. I will elaborate more on that theory in a future post, but the main point is that there is a very high barrier to entry when it comes to challenging NVIDIA chips for pretraining. The only viable candidate seems to be Google TPUs, which have definitely been tested at massive scale.

If pretraining is out of the equation, the obvious area to explore is inference. Here, we have a completely different playing field, where performance optimizations can be applied at a smaller scale, making it more conducive to startup disruptions.

One of the viable challengers to NVIDIA’s dominance in AI inference is Cerebras. Just last week, the well-funded startup unveiled Cerebras Inference, a solution capable of delivering Llama 3.1 8B at 450 tokens per second for Llama 3.1 70B. This is approximately 20x faster than NVIDIA GPUs and about 2.4x faster than Groq. The magic behind Cerebras’ performance is its AI chip design, which allows the entire model to be stored on-chip, eliminating the need for GPU communication.

Cerebras Inference looks impressive from top to bottom and clearly showcases the massive potential for innovation in AI inference. Competing with NVIDIA will require more than just faster chips, but Cerebras appears to be a legitimate challenger.

🔎 ML Research

The Mamba in the Llama

Researchers from Princeton University, Together AI, Cornell University and other academic institutions published a paper proposing a technique to distill and accelerate transformer-SSM models. The method distills transformers into RNN-equivalents with a quarter of the hidden layers —> Read more.

Diffusion Models as Real Time Game Engines

Google Research published a paper presenting GameNGen, a game engine powered by diffusion models and interactions with real environments over long trajectories. GameNGen can simulate a DOOM game in over 20 frames in a single TPU —> Read more.

LLMs that Learn from Mistakes

Researchers from Meta FAIR and Carnegie Mellon University published a paper outlining a technique to include error-correction data directly in the pretraining stage in order to improve reasoning capabilities. The resulting model outperform alternatives trained in error-free data —> Read more.

Table Augmented Generation

In a new paper, Researchers from UC Berkeley proposed table augmented generation, a method that addresses some of the limitations of text-to-SQL and RAG and answer questions in relational databases. The TAG model captures a very complete sets of interaction between an LLM and a database —> Read more.

DisTrO

Nous Research published a paper introducing DisTrO, an architecture that reduces inter-GPU communication by up to 5 orders of magnitude. DisTrO is an important method for low latency training of large neural networks —> Read more.

Brain Inspired Design

Microsoft Research published a summary of their recent research in three projects that simualte the brain learns. One project simulates the brain computes information, another enhances accuracy and efficiency and the third one shows improves proficiency in language processing and pattern recognition —> Read more.