Cerebras Inference and the Challenges of Challenging NVIDIA’s Dominance

Why does NVIDIA remains virtually unchallenged in the AI chip market?

Created Using Ideogram

Next Week in The Sequence:

  • Edge 427: Our series about state space models(SSM) continues with a review of AI21’s Jamba, a model that combines transformers and SSMs. We discuss Jamba’s original research paper and the DeepEval framework.

  • Edge 428: We dive into PromptPoet, Character.ai’s framework for prompt optimization.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Cerebras Inference and the Challenges of Challenging NVIDIA’s Dominance

AI hardware is experiencing an innovation renaissance, with well-funded startups emerging everywhere. Yet, NVIDIA remains virtually unchallenged, holding close to a 90% share of the AI chip market. Why is that?

We’ve all heard explanations about the advantages of NVIDIA’s software stack for acceleration compared to platforms like AMD’s, which seems like a lazy explanation for why NVIDIA is out-innovating its competitors. A simple theory that I’ve discussed with several scientists and engineers who pretrain large foundation models is that NVIDIA is the only platform receiving regular feedback about the performance of chips during pretraining runs with tens of thousands of GPUs. It turns out that at that scale, many challenges arise that are nearly impossible to simulate on a smaller scale. I will elaborate more on that theory in a future post, but the main point is that there is a very high barrier to entry when it comes to challenging NVIDIA chips for pretraining. The only viable candidate seems to be Google TPUs, which have definitely been tested at massive scale.

If pretraining is out of the equation, the obvious area to explore is inference. Here, we have a completely different playing field, where performance optimizations can be applied at a smaller scale, making it more conducive to startup disruptions.

One of the viable challengers to NVIDIA’s dominance in AI inference is Cerebras. Just last week, the well-funded startup unveiled Cerebras Inference, a solution capable of delivering Llama 3.1 8B at 450 tokens per second for Llama 3.1 70B. This is approximately 20x faster than NVIDIA GPUs and about 2.4x faster than Groq. The magic behind Cerebras’ performance is its AI chip design, which allows the entire model to be stored on-chip, eliminating the need for GPU communication.

Cerebras Inference looks impressive from top to bottom and clearly showcases the massive potential for innovation in AI inference. Competing with NVIDIA will require more than just faster chips, but Cerebras appears to be a legitimate challenger.

🔎 ML Research

The Mamba in the Llama

Researchers from Princeton University, Together AI, Cornell University and other academic institutions published a paper proposing a technique to distill and accelerate transformer-SSM models. The method distills transformers into RNN-equivalents with a quarter of the hidden layers —> Read more.

Diffusion Models as Real Time Game Engines

Google Research published a paper presenting GameNGen, a game engine powered by diffusion models and interactions with real environments over long trajectories. GameNGen can simulate a DOOM game in over 20 frames in a single TPU —> Read more.

LLMs that Learn from Mistakes

Researchers from Meta FAIR and Carnegie Mellon University published a paper outlining a technique to include error-correction data directly in the pretraining stage in order to improve reasoning capabilities. The resulting model outperform alternatives trained in error-free data —> Read more.

Table Augmented Generation

In a new paper, Researchers from UC Berkeley proposed table augmented generation, a method that addresses some of the limitations of text-to-SQL and RAG and answer questions in relational databases. The TAG model captures a very complete sets of interaction between an LLM and a database —> Read more.

DisTrO

Nous Research published a paper introducing DisTrO, an architecture that reduces inter-GPU communication by up to 5 orders of magnitude. DisTrO is an important method for low latency training of large neural networks —> Read more.

Brain Inspired Design

Microsoft Research published a summary of their recent research in three projects that simualte the brain learns. One project simulates the brain computes information, another enhances accuracy and efficiency and the third one shows improves proficiency in language processing and pattern recognition —> Read more.

🤖 AI Tech Releases

Qwen2-VL

Alibaba Research released a new version of Qwen2-VL their marquee vision language model —> Read more.

Cerebras Inference

Cerebras released an impressive inference solution that can generate 1800 tokens per second in Llama 3.1 models —> Read more.

NVIDIA NIM Blueprints

NVIDIA released NIM Blueprints, a series of templates to help enterprises get started with generative AI applications —> Read more.

Gemini Models

Google DeepMind released a new series of experimental models —> Read more.

Command R

Cohere released a new version of Command R with improvements in coding, math, reasoning and latency —> Read more.

🛠 Real World AI

Recommendations at Netflix

Netflix discusses some of the AI techniques to enhances long term satisfaction in their content recommendations —> Read more.

📡AI Radar

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.