Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

The recent advancements in the architecture and performance of Multimodal Large Language Models or MLLMs has highlighted the significance of scalable data and models to enhance performance. Although this approach does enhance the performance, it incurs substantial computational costs that limits the practicality and usability of such approaches. Over the years, Mixture of Expert or MoE models have emerged as a successful alternate approach to scale image-text and large language models efficiently since Mixture of Expert models have significantly lower computational costs, and strong performance. However, despite their advantages, Mixture of Models are not the ideal approach to scale large language models since they often involve fewer experts, and limited modalities, thus limiting the applications. 

To counter the roadblocks encountered by current approaches, and to scale large language models efficiently, in this article, we will talk about Uni-MoE, a unified multimodal large language model with a MoE or Mixture of Expert architecture that is capable of handling a wide array of modalities and experts. The Uni-MoE framework also implements a sparse Mixture of Expert architecture within the large language models in an attempt to make the training and inference process more efficient by employing expert-level model parallelism and data parallelism. Furthermore, to enhance generalization and multi-expert collaboration, the Uni-MoE framework presents a progressive training strategy that is a combination of three different processes. In the first, the Uni-MoE framework achieves cross-modality alignment using various connectors with different cross modality data. Second, the Uni-MoE framework activates the preference of the expert components by training modality-specific experts with cross modality instruction data. Finally, the Uni-MoE model implements the LoRA or Low-Rank Adaptation learning technique on mixed multimodal instruction data to tune the model. When the instruction-tuned Uni-MoE framework was evaluated on a comprehensive set of multimodal datasets, the extensive experimental results highlighted the principal advantage of the Uni-MoE framework in reducing performance bias in handling mixed multimodal datasets significantly. The results also indicated a significant improvement in multi-expert collaboration, and generalization. 

This article aims to cover the Uni-MoE framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

The advent of open-source multimodal large language models including LLama and InstantBlip have outlined the notable success and advancement in tasks involving image-text understanding over the past few years. Furthermore, the AI community is working actively towards building a unified multimodal large language model that could accommodate a wide array of modalities including image, text, audio, video, and more, moving beyond the traditional image-text paradigm. A common approach followed by the open source community to boost the abilities of multimodal large language models is to increase the size of vision foundation models, and integrating it with large language models with billions of parameters, and using diverse multimodal datasets to enhance instruction tuning. These developments have highlighted the increasing ability of multimodal large language models to reason and process multiple modalities, showcasing the importance of expanding multimodal instructional data and model scalability. 

Although scaling up a model is a tried and tested approach that delivers substantial results, scaling a model is a computationally expensive process for both the training and inference processes. 

To counter the issue of high overhead computational costs, the open source community is moving towards integrating the MoE or Mixture of Expert model architecture in large language models to enhance both the training and inference efficiency. Contrary to multimodal large language and large language models that employ all the available parameters to process each input resulting in a dense computational approach, the Mixture of Expert architecture only requires the users to activate a subset of expert parameters for each input. As a result, the Mixture of Expert approach emerges as a viable route to enhance the efficiency of large models without extensive parameter activation, and high overhead computational costs. Although existing works have highlighted the successful implementation and integration of Mixture of Expert models in the construction of text-only and text-image large models, researchers are yet to fully explore the potential of developing the Mixture of Expert architecture to construct powerful unified multimodal large language models. 

Uni-MoE is a multimodal large language model that leverages sparse Mixture of Expert models to interpret and manage multiple modalities in an attempt to explore scaling unified multimodal large language models with the MoE architecture. As demonstrated in the following image, the Uni-MoE framework first obtains the encoding of different modalities using modality-specific encoders, and then maps these encodings into the language representation space of the large language models using various designed connectors. These connectors contain a trainable transformer model with subsequent linear projections to distill and project the output representations of the frozen encoder. The Uni-MoE framework then introduces a sparse Mixture of Expert layers within the internal block of the dense Large Language Model. As a result, each Mixture of Expert based block features a shared self-attention layer applicable across all modalities, a sparse router for allocating expertise at token level, and diverse experts based on the feedforward network. Owing to this approach, the Uni-MoE framework is capable of understanding multiple modalities including speech, audio, text, video, image, and only requires activating partial parameters during inference. 

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Furthermore, to enhance multi-expert collaboration and generalization, the Uni-MoE framework implements a three-stage training strategy. In the first stage, the framework uses extensive image/audio/speech to language pairs to train the corresponding connector owing to the unified modality representation in the language space of the large language model. Second, the Uni-MoE model trains modality-specific experts employing cross-modality datasets separately in an attempt to refine the proficiency of each expert within its respective domain. In the third stage, the Uni-MoE framework integrates these trained experts into the Mixture of Expert layer of the large language model, and trains the entire Uni-MoE framework with mixed multimodal instruction data. To reduce the training cost further, the Uni-MoE framework employs the LoRA learning approach to fine-tune these self-attention layers and the pre-tuned experts. 

Uni-MoE : Methodology and Architecture

The basic motivation behind the Uni-MoE framework is the high training and inference cost of scaling multimodal large language models along with the efficiency of Mixture of Expert models, and explore the possibility of creating an efficient, powerful, and unified multimodal large language model utilizing the MoE architecture. The following figure presents a representation of the architecture implemented in the Uni-MoE framework demonstrating the design that includes individual encoders for different modalities i.e. audio, speech and visuals along with their respective modality connectors. 

The Uni-MoE framework then integrates the Mixture of Expert architecture with the core large language model blocks, a process crucial for boosting the overall efficiency of both the training and inference process. The Uni-MoE framework achieves this by implementing a sparse routing mechanism. The overall training process of the Uni-MoE framework can be split into three phases: cross-modality alignment, training modality-specific experts, and tuning Uni-MoE using a diverse set of multimodal instruction datasets. To efficiently transform diverse modal inputs into a linguistic format, the Uni-MoE framework is built on top of LLaVA, a pre-trained visual language framework. The LLaVA base model integrates CLIP as its visual encoder alongside a linear projection layer that converts image features into their corresponding soft image tokens. Furthermore, to process video content, the Uni-MoE framework selects eight representative frames from each video, and transforms them into video tokens by average pooling to aggregate their image or frame-based representation. For audio tasks, the Uni-MoE framework deploys two encoders, BEATs and the Whisper encoder to enhance feature extraction. The model then distills audio features vector and fixed-length speech, and maps them into speech tokens and soft audio respectively via a linear projection layer. 

Training Strategy

The Uni-MoE framework introduces a progressive training strategy for the incremental development of the model. The progressive training strategy introduced attempts to harness the distinct capabilities of various experts, enhance multi-expert collaboration efficiency, and boost the overall generalizability of the framework. The training process is split into three stages with the attempt to actualize the MLLM structure built on top of integrated Mixture of Experts. 

Stage 1 : Cross Modality Alignment

In the first stage, the Uni-MoE framework attempts to establish connectivity between different linguistics and modalities. The Uni-MoE framework achieves this by translating modal data into soft tokens by constructing connectors. The primary object of the first training stage is to minimize the generative entropy loss.  Within the Uni-MoE framework, the LLM is optimized to generate descriptions for inputs across different modalities, and the model only subjects the connectors to training, a strategy that enables the Uni-MoE framework to integrate different modalities within a unified language framework. 

Stage 2: Training Modality Specific Experts

In the second stage, the Uni-MoE framework focuses on developing single modality experts by training the model dedicatedly on specific cross modality data. The primary objective is to refine the proficiency of each expert within its respective domain, thus enhancing the overall performance of the Mixture of Expert system on a wide array of multimodal data. Furthermore, the Uni-MoE framework tailors the feedforward networks to align more closely with the characteristics of the modality while maintaining generative entropy loss as focal metric training. 

Stage 3: Tuning Uni-MoE

In the third and the final stage, the Uni-MoE framework integrates the weights tuned by experts during Stage 2 into the Mixture of Expert layers. The Uni-MoE framework then fine-tunes the MLLMs utilizing mixed multimodal instruction data jointly. The loss curves in the following image reflect the progress of the training process. 

Comparative analysis between the configurations of Mixture of Expert revealed that the experts the model refined during the 2nd training stage displayed enhanced stability and achieved quicker convergence on mixed-modal datasets. Furthermore, on tasks that involved complex multi-modal data including text, images, audio, videos, the Uni-MoE framework demonstrated more consistent training performance and reduced loss variability when it employed four experts than when it employed two experts. 

Uni-MoE : Experiments and Results

The following table summarizes the architectural specifications of the Uni-MoE framework. The primary goal of the Uni-MoE framework, built on LLaMA-7B architecture, is to scale the model size. 

The following table summarizes the design and optimization of the Uni-MoE framework as guided by specialized training tasks. These tasks are instrumental in refining the capabilities of the MLP layers, thereby leveraging their specialized knowledge for enhanced model performance. The Uni-MoE framework undertakes eight single-modality expert tasks to elucidate the differential impacts of various training methodologies. 

The model evaluates the performance of various model variants across a diverse set of benchmarks that encompasses two video-understanding, three audio-understanding, and five speech-related tasks. First, the model is tested on its ability to understand speech-image and speech-text tasks, and the results are contained in the following table. 

As it can be observed, the previous baseline models deliver inferior results across speech understanding tasks which further impacts the performance on image-speech reasoning tasks. The results indicate that introducing Mixture of Expert architecture can enhance the generalizability of MLLMs on unseen audi-image reasoning tasks. The following table presents the experimental results on image-text understanding tasks. As it can be observed, the best results from the Uni-MoE models outperforms the baselines, and surpasses the fine-tuning task by an average margin of 4 points. 

Final Thoughts

In this article we have talked about Uni-MoE, , a unified multimodal large language model with a MoE or Mixture of Expert architecture that is capable of handling a wide array of modalities and experts. The Uni-MoE framework also implements a sparse Mixture of Expert architecture within the large language models in an attempt to make the training and inference process more efficient by employing expert-level model parallelism and data parallelism. Furthermore, to enhance generalization and multi-expert collaboration, the Uni-MoE framework presents a progressive training strategy that is a combination of three different processes. In the first, the Uni-MoE framework achieves cross-modality alignment using various connectors with different cross modality data. Second, the Uni-MoE framework activates the preference of the expert components by training modality-specific experts with cross modality instruction data. Finally, the Uni-MoE model implements the LoRA or Low-Rank Adaptation learning technique on mixed multimodal instruction data to tune the model.