MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Recent advancements in Large Vision Language Models (LVLMs) have shown that scaling these frameworks significantly boosts performance across a variety of downstream tasks. LVLMs, including MiniGPT, LLaMA, and others, have achieved remarkable capabilities by incorporating visual projection layers and an image encoder into their architecture. By implementing these components, LVLMs enhance the visual perception capabilities of Large Language Models (LLMs). Performance can be further improved by increasing the model’s size and number of parameters, as well as expanding the dataset scale.

Models like InternVL have expanded their image encoder to over 6 billion parameters, while others have extended the backend of LVLMs to 13 billion parameters, achieving superior performance on a wide array of tasks. IDEFICS has trained an LVLM with over 80 billion parameters. These scaling methods have matched or exceeded the performance of LLMs pretrained on over 34, 70, or even 100 billion parameters. However, scaling has a downside: it significantly increases training and inference costs. This is because it requires all parameters to be active for each token in calculation, leading to high computational needs and, consequently, higher costs.

This article discusses MoE-LLaVA, a Mixture of Experts (MoE)-based sparse LVLM architecture that employs an effective training strategy, MoE-Tuning, for LVLMs. MoE-Tuning innovatively addresses performance degradation in multi-modal sparsity learning, resulting in a model with a large number of parameters but consistent training and inference costs. The MoE-LLaVA architecture is designed to activate only the top-k experts during deployment, keeping the rest inactive.

We aim to thoroughly explore the MoE-LLaVA framework, examining its mechanism, methodology, architecture, and how it compares with leading image and video generation frameworks. Let’s delve into the details.

In addition to leveraging visual projection layers and image encoders, Large Vision Language Models also scale up the model size by increasing the number of parameters to enhance the performance of the model. Some notable examples of Large Vision Language Models that have followed this approach to enhance their performance are MiniGPT-4, InternGPT, InternVL, and others. In real-world applications, scaling a Large Language Model or a Large Vision Language Model with high-quality training data often becomes a necessity to improve the performance of the model. Although scaling a model size does improve the performance, it also increases the computational costs of training and deploying the model, and further increases the complications and efficiency of deploying the model on parallel devices simultaneously. A major reason behind the increased training and inference costs along with computational requirements is that each token in the framework demands computation with every single parameter within the model known as the dense model. 

On the other hand, sparse MoE or Mixture of Expert Models have demonstrated effective scaling of frameworks by processing data with the help of fixed activated parameters, an approach that has been widely adopted in the Natural Language Processing field. However, using Mixture of Expert to train sparse Large Vision Language Models directly is challenging since converting LLMs to LVLMs and sparsifying the model simultaneously results in significant performance degradation. To implement Mixture of Models to scale LLMs and LVLMs, it is essential to first initialize the LVLM for sparsification. To achieve this, the MoE-LLaVA framework introduces MoE-Tuning, a simple yet effective three phase training strategy. 

As shown in the above figure, the MoE-Tuning process first trains a MLP or a Multilayer Perceptron that adapts the visual tokens to a Large Language Model in the first stage. The framework then trains the entire parameters of the LLM to pre-empower the Large Vision Language Model with a general multi-modal understanding capabilities. Finally, in the third stage, the framework replicates the FFN or Feed Forward Network as the initialization weights for the experts, and trains only the Mixture of Expert layers. Overall, the training process helps in the gradual transition of the sparse model from a LVLM initialization to a sparse mixture of expert models. 

With the training process being covered, let us shine some light on MoE-LLaVA, a baseline for Large Vision Language Models with Mixture of Expert models that incorporates learnable routers and MoE models. At its core, the MoE-LLaVA model consists of multiple sparse paths, and the framework uses these paths to dispatch each token to different experts through the learnable router. The tokens are then processed collectively by the activated experts while keeping the inactive paths silent. The framework then stacks the Mixture of Expert encoder layers iteratively to provide a sparse path towards a larger and more powerful LVLM. 

Thanks to the approach implemented by the MoE-LLaVA framework, it is able to outperform models with a similar number of activated parameters, and surpass them by a large difference on the POPE object hallucination benchmark, despite having only 2.2 billion parameters. Furthermore, the MoE-LLaVA framework with 2.2 billion parameters, is able to achieve performance comparable to the InternVL-Chat-19B framework with nearly 8 times the number of activated parameters. 

Furthermore, powerful Large Language Models with strong generalization and instruction following capabilities have been implemented to Large Vision Language Models. Early LLMs like BLIP encoded visual signals into a sequence of visual tokens allowing them to adapt vision to LLMs successfully using multiple projection layers. At the same time, recent works focus on improving the model performance by implementing methods like expanding the instruction-tuning dataset, increasing the resolution of the image, optimizing training strategies, aligning the input, enhancing the image encoders, and much more. These approaches have helped empower LVLMs with powerful visual understanding capabilities by expanding the visual instruction fine-tuning dataset and model scales. Furthermore, some LVLMs also possess fine-grained image understanding capabilities such as region and multi-region understanding along with pixel-wise grounding capabilities. However, the computational cost accompanied with scaling up dense visual data and models is often significantly high which makes it challenging to wear. On the other hand, the MoE-LLaVA framework aims to make LVLM research more affordable by leveraging the capabilities of MoE models. 

MoE-LLaVA : Method and Architecture

At its core, the MoE-LLaVA framework consists of a visual projection layer (Multilayer Perceptron), a vision encoder, MoE blocks, multiple stacked LLM blocks, and a word embedding layer. 

Architecture

The following table summarizes the detailed configurations of the MoE-LLaVA framework. 

For a given RGB image, the vision encoder processes the images to obtain a sequence of visual tokens with a visual projection layer mapping the visual token sequence to input images. The text inputs are processed by the word embedding layer that then projects it to obtain the sequence tokens. At the same time, the MoE-LLaVA framework concatenates the text and visual tokens together, and feeds them to the LLM. However, the framework only trains the visual projection layer with the large language model consisting of FFN or Feedforward Neural Networks, and Multi-Head Self Attention Layers. Finally, the framework applies residual connections and layer normalization to each block. 

Moving along, the MoE-LLaVA framework replicates the FFN or Feedforward Neural Networks from the second stage to form an ensemble of experts as the initialization step. The router being a linear layer, predicts the probability of each token being assigned to each expert. Each token is processed by the top-k experts with the maximum probability, and calculates the weighted sum based on the softmax result of the probabilities. 

MoE-Tuning

MoE-Tuning is a simple yet effective three phase training strategy that first trains a MLP or a Multilayer Perceptron that adapts the visual tokens to a Large Language Model in the first stage. The framework then trains the entire parameters of the LLM to pre-empower the Large Vision Language Model with a general multi-modal understanding capabilities. Finally, in the third stage, the framework replicates the FFN or Feed Forward Network as the initialization weights for the experts, and trains only the Mixture of Expert layers. 

Stage 1

In the first stage, the primary objective is to adapt the image tokens to the large language model that allows the LLM to comprehend the instances in the image. The MoE-LLaVA framework employs a multilayer perceptron to project the image tokens into the input domain of the large language model, and treats image patches as pseudo-text tokens. In this stage, the MoE-LLaVA framework trains the LLM to describe the images, and does not apply the MoE layers to the LLM during this stage.

Stage 2

In the second stage, the MoE-LLaVA attempts to enhance the capabilities and controllability of the framework by tuning the model with multi-modal instruction data. The MoE-LLaVA framework achieves this by adjusting the LLM to become a LVLM with multi-modal understanding capabilities. The framework employs more complex instructions including text recognition and logical image reasoning tasks that require the model to possess stronger multi-modal capabilities. Traditionally, the training process for dense models is considered to be complete by this step. However, the MoE-LLaVA framework encountered challenges in transforming the LLM into a LVLM simultaneously with sparsifying the LVLM. To counter this challenge, the framework utilizes the weights from the stage as initialization for the next stage in an attempt to alleviate the learning difficulty of the sparse model. 

Stage 3

In the third step, the model replicates the feedforward neural network several times to initialize the experts as an initialization procedure. The framework then feeds the text and image tokens into the mixture of expert layers following which the router calculates the matching weights between experts and each tokens. Each token is then processed by the top-k experts with the aggregated output calculated by weighted summation based on the weights of the router. Once the top-k experts are activated, the model shuts the remaining experts, an approach that equips the MoE-LLaVA framework with infinitely possible sparse paths, thus equipping the model with a wide range of capabilities. 

MoE-LLaVA : Results and Experiments

The MoE-LLaVA framework adopts CLIP-Large as the vision encoder with the Multilayer Perceptron consisting of two layers with a GELU activation layer separating the two. By default, the framework employs an alternating replacement of the feedforward neural networks with the mixture of expert layers, meaning the mixture of expert layers comprise 50% of the total number of layers. The following table contains the different datasets along with their sample size used to train and evaluate the MoE-LLaVA framework. 

Zero-Shot Image Question Answering

The following figure demonstrates that MoE-LLaVA is a sparse model with a soft router based on LVLM. The framework is evaluated on 5 image question answering benchmarks, and as it can be observed, the MoE-LLaVA framework demonstrates remarkable image understanding capabilities, and delivers comparable performance to the state of the art LLaVA 1.5 framework on five different benchmarks. 

Object Hallucination Evaluation

To evaluate object hallucination, the MoE-LLaVA framework adopts the POPE evaluation pipeline, a polling-based query method, and the results are demonstrated in the following table. As it can be observed, out of all the frameworks, the MoE-LLaVA delivers the strongest results, indicating the ability of the framework to generate objects consistent with the input image. Additionally, it is worth noting that the MoE-LLaVA framework balances the yes ratio well, indicating the capability of the sparse model to provide accurate feedback for the given question. 

The following image contains the distribution of expert loadings, where the discontinuous lines represent a well balanced distribution of tokens among the modalities or experts. The first figure illustrates the workload within the experts while the remaining images demonstrate the performance of experts towards different modalities. 

Furthermore, the following figure demonstrates the distribution of modalities across different experts. 

Final Thoughts

In this article we have talked about MoE-LLaVA, a baseline for Large Vision Language Models with Mixture of Expert models that incorporates learnable routers and MoE models. At its core, the MoE-LLaVA model consists of multiple sparse paths, and the framework uses these paths to dispatch each token to different experts through the learnable router. The tokens are then processed collectively by the activated experts while keeping the inactive paths silent. The framework then stacks the Mixture of Expert encoder layers iteratively to provide a sparse path towards a larger and more powerful LVLM. The MoE-Tuning strategy addresses the common issue of performance degradation in multi-modal sparsity learning innovatively, consequently constructing a model with a significantly large number of parameters but consistent training and inference costs. The architecture of the MoE-LLaVA framework has been designed in a way that it only activates the top-k experts during deployment while keeping the remaining experts inactive.