Mistral AI’s Latest Mixture of Experts (MoE) 8x7B Model

Mistral AI which is a Paris-based open-source model startup has challenged norms by releasing its latest large language model (LLM), MoE 8x7B, through a simple torrent link. This contrasts Google’s traditional approach with their Gemini release, sparking conversations and excitement within the AI community.

Mistral AI’s approach to releases has always been unconventional. Often foregoing the usual accompaniments of papers, blogs, or press releases, their strategy has been uniquely effective in capturing the AI community’s attention.

Recently, the company achieved a remarkable $2 billion valuation following a funding round led by Andreessen Horowitz. This funding round was historic, setting a record with a $118 million seed round, the largest in European history. Beyond funding successes, Mistral AI’s active involvement in discussions around the EU AI Act, advocating for reduced regulation in open-source AI.

Why MoE 8x7B is Drawing Attention

Described as a “scaled-down GPT-4,” Mixtral 8x7B utilizes a Mixture of Experts (MoE) framework with eight experts. Each expert have 111B parameters, coupled with 55B shared attention parameters, to give a  total of 166B parameters per model. This design choice is significant as it allows for only two experts to be involved in the inference of each token, highlighting a shift towards more efficient and focused AI processing.

One of the key highlights of Mixtral is its ability to manage an extensive context of 32,000 tokens, providing ample scope for handling complex tasks. The model’s multilingual capabilities include robust support for English, French, Italian, German, and Spanish, catering to a global developer community.

The pre-training of Mixtral involves data sourced from the open Web, with a simultaneous training approach for both experts and routers. This method ensures that the model is not just vast in its parameter space but also finely tuned to the nuances of the vast data it has been exposed to.

Mixtral 8x7B achieves an impressive score

Mixtral 8x7B outperforms LLaMA 2 70B and rivaling GPT-3.5, especially notable in the MBPP task with a 60.7% success rate, significantly higher than its counterparts. Even in the rigorous MT-Bench tailored for instruction-following models, Mixtral 8x7B achieves an impressive score, nearly matching GPT-3.5

Understanding the Mixture of Experts (MoE) Framework

The Mixture of Experts (MoE) model, while gaining recent attention due to its incorporation into state-of-the-art language models like Mistral AI’s MoE 8x7B, is actually rooted in foundational concepts that date back several years. Let’s revisit the origins of this idea through seminal research papers.

The Concept of MoE

Mixture of Experts (MoE) represents a paradigm shift in neural network architecture. Unlike traditional models that use a singular, homogeneous network to process all types of data, MoE adopts a more specialized and modular approach. It consists of multiple ‘expert’ networks, each designed to handle specific types of data or tasks, overseen by a ‘gating network’ that dynamically directs input data to the most appropriate expert.

A Mixture of Experts (MoE) layer embedded within a recurrent language model

A Mixture of Experts (MoE) layer embedded within a recurrent language model (Source)

The above image presents a high-level view of an MoE layer embedded within a language model. At its essence, the MoE layer comprises multiple feed-forward sub-networks, termed ‘experts,’ each with the potential to specialize in processing different aspects of the data. A gating network, highlighted in the diagram, determines which combination of these experts is engaged for a given input. This conditional activation allows the network to significantly increase its capacity without a corresponding surge in computational demand.

Functionality of the MoE Layer

In practice, the gating network evaluates the input (denoted as G(x) in the diagram) and selects a sparse set of experts to process it. This selection is modulated by the gating network’s outputs, effectively determining the ‘vote’ or contribution of each expert to the final output. For example, as shown in the diagram, only two experts may be chosen for computing the output for each specific input token, making the process efficient by concentrating computational resources where they are most needed.

Transformer Encoder with MoE Layers (Source)

The second illustration above contrasts a traditional Transformer encoder with one augmented by an MoE layer. The Transformer architecture, widely known for its efficacy in language-related tasks, traditionally consists of self-attention and feed-forward layers stacked in sequence. The introduction of MoE layers replaces some of these feed-forward layers, enabling the model to scale with respect to capacity more effectively.

In the augmented model, the MoE layers are sharded across multiple devices, showcasing a model-parallel approach. This is critical when scaling to very large models, as it allows for the distribution of the computational load and memory requirements across a cluster of devices, such as GPUs or TPUs. This sharding is essential for training and deploying models with billions of parameters efficiently, as evidenced by the training of models with hundreds of billions to over a trillion parameters on large-scale compute clusters.

The Sparse MoE Approach with Instruction Tuning on LLM

The paper titled “Sparse Mixture-of-Experts (MoE) for Scalable Language Modeling” discusses an innovative approach to improve Large Language Models (LLMs) by integrating the Mixture of Experts architecture with instruction tuning techniques.

It highlights a common challenge where MoE models underperform compared to dense models of equal computational capacity when fine-tuned for specific tasks due to discrepancies between general pre-training and task-specific fine-tuning.

Instruction tuning is a training methodology where models are refined to better follow natural language instructions, effectively enhancing their task performance. The paper suggests that MoE models exhibit a notable improvement when combined with instruction tuning, more so than their dense counterparts. This technique aligns the model’s pre-trained representations to follow instructions more effectively, leading to significant performance boosts.

The researchers conducted studies across three experimental setups, revealing that MoE models initially underperform in direct task-specific fine-tuning. However, when instruction tuning is applied, MoE models excel, particularly when further supplemented with task-specific fine-tuning. This suggests that instruction tuning is a vital step for MoE models to outperform dense models on downstream tasks.

The effect of instruction tuning on MOE

The effect of instruction tuning on MOE

It also introduces FLAN-MOE32B, a model that demonstrates the successful application of these concepts. Notably, it outperforms FLAN-PALM62B, a dense model, on benchmark tasks while using only one-third of the computational resources. This showcases the potential for sparse MoE models combined with instruction tuning to set new standards for LLM efficiency and performance.

Implementing Mixture of Experts in Real-World Scenarios

The versatility of MoE models makes them ideal for a range of applications:

  • Natural Language Processing (NLP): MoE models can handle the nuances and complexities of human language more effectively, making them ideal for advanced NLP tasks.
  • Image and Video Processing: In tasks requiring high-resolution processing, MoE can manage different aspects of images or video frames, enhancing both quality and processing speed.
  • Customizable AI Solutions: Businesses and researchers can tailor MoE models to specific tasks, leading to more targeted and effective AI solutions.

Challenges and Considerations

While MoE models offer numerous benefits, they also present unique challenges:

  • Complexity in Training and Tuning: The distributed nature of MoE models can complicate the training process, requiring careful balancing and tuning of the experts and gating network.
  • Resource Management: Efficiently managing computational resources across multiple experts is crucial for maximizing the benefits of MoE models.

Incorporating MoE layers into neural networks, especially in the domain of language models, offers a path toward scaling models to sizes previously infeasible due to computational constraints. The conditional computation enabled by MoE layers allows for a more efficient distribution of computational resources, making it possible to train larger, more capable models. As we continue to demand more from our AI systems, architectures like the MoE-equipped Transformer are likely to become the standard for handling complex, large-scale tasks across various domains.