MambaOut: Do We Really Need Mamba for Vision?

In modern machine learning and artificial intelligence frameworks, transformers are one of the most widely used components across various domains including GPT series, and BERT in Natural Language Processing, and Vision Transformers in computer vision tasks. Although including transformers in the model architecture gives a significant boost in the model performance, the attention module in Transformers scales with the sequence length quadratically, leading to high computational challenges. Over the years, different models have explored different strategies to tackle the computational challenges including methods like kernelization, history memory compression, token mixing range limitation, and low-rank approaches. Recently, Recurrent Neural Networks like methods including Mamba and RWKV have gathered significant attention owing to their promising results in large language models. 

Mamba, a family of models has an architecture with a Recurrent Neural Network like token mixer of a state space model was recently introduced to address the quadratic complexity of the attention mechanisms and was applied to vision tasks subsequently.  Researchers have already explored ways to incorporate Mamba and SSM or State Space Model into visual recognition tasks, and Vision Mamba that incorporates Mamba to develop isotropic vision models akin to Vision Transformer is a great example of the same. On the other hand, LocalMamba incorporates local inductive biases to enhance visual Mamba models, and VMamba framework employs the base Mamba model to construct hierarchical models similar to ResNet and AlexNet. However, is the Mamba framework really essential for visual recognition context tasks? The question arises because the performance of the Mamba family of models for vision tasks has been underwhelming so far when compared against traditional attention-based and convolutional models. 

MambaOut is a work that attempts to delve into the essence of the Mamba framework, and answer whether Mamba is ideally suited for tasks with autoregressive and long-sequence characteristics. The MambaOut framework hypothesizes that Mamba is not necessary for vision tasks since image classification does not align either with long-sequence or autoregressive characteristics. Although segmentation and detection tasks are also not autoregressive, they display long-sequence characteristics, leading the MambaOut framework to hypothesize the potential of Mamba for these tasks. The MambaOut framework is constructed by stacking Mamba blocks on top of one another while removing the state space model, its core token mixer. The experimental results support the hypothesis put forward by the MambaOut framework since it is able to surpass all the visual Mamba models on the ImageNet image classification framework, indicating the Mamba is not necessary for vision tasks. On the other hand for detection and segmentation tasks, MambaOut framework is unable to replicate the performance offered by state of the art Mamba model, demonstrating the potential of the Mamba family of models for long-sequence visual tasks. 

This article aims to cover the MambaOut framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started. 

With the progress of machine learning applications and capabilities, Transformers have emerged as the mainstream backbone for a range of tasks, powering prominent models including Vision Transformers, GPT series of models, BERT, and a few more. However, the token mixer of the transformer incurs a quadratic complexity with respect to the sequence length, and poses significant challenges for longer sequences. To address this issue, numerous token mixers with linear complexity to token length like Linformer, Longformer, Performer, Dynamic Convolution, and Big Bird have been introduced. However, in recent times, Recurrent Neural Network like models are gaining prominence owing to their capability of parallelizable training, and delivering efficient performance on longer sequences. Guided by the remarkable performance offered by RNN-like models, researchers are attempting to introduce and utilize the Mamba family of models into visual recognition tasks since the token mixer of the Mamba models is the structured state space model under the spirit of the Recurrent Neural Networks. However, experimental results indicate that state space model based frameworks for vision perform underwhelmingly across real-world vision tasks when compared against attention-based, and state of the art convolutional models. 

MambaOut is an attempt to investigate the nature of the Mamba family of models, and summarizes that Mamba is suited for tasks that are either autoregressive or of long-sequence since the state space model has an inherent RNN mechanism. However, a majority of vision tasks do not feature both of these characteristics, and on the basis of some experiments, MambaOut proposes the following two hypotheses. First, the state space model is not necessary for image classification since the image classification task conforms neither to autoregressive nor long-sequence characteristics. Second, state space models may be hypothetically beneficial for instance segmentation and semantic segmentation along with object detection, since they follow the long-sequence characteristics although they are not autoregressive. Experimental results conducted to analyze the Recurrent Neural Network like mechanism of state space model conclude that the Mamba framework is suited for tasks with autoregressive or long-sequence characteristics, and is unnecessary for image classification tasks. Coming to the MambaOut framework itself, it is a series of Mamba models based on Gated Convolutional Neural Network blocks without the state space model, and experimental results indicate that the MambaOut framework is capable of outperforming Mamba models in image classification tasks, but fails to replicate the performance on image detection and segmentation tasks. 

What tasks is Mamba suitable for?

The token mixer of the Mamba framework is a selective state space model that defines four input-dependent parameters. The recurrent property of the framework distinguishes RNN-like state space models from causal attention. The hidden state can be seen as a fixed-size memory that stores historical information. The fixed size means that the memory is lossy, but it also ensures the computational complexity of integrating memory with the current input remains constant. Conversely, causal attention layers store all keys and values from previous tokens, and expands by adding the key and value of the current token with each new input, and this memory is lossless, theoretically. However, the memory size grows as more tokens are inputted, increasing the complexity of integrating the memory with the current input. The difference between the memory mechanisms between causal attention and RNN-like models are illustrated in the following figure. 

MambaOut: Do We Really Need Mamba for Vision?

Since the memory of the state space model is inherently lossy, it falls short of the lossless memory of causal attention, and as a result, the Mamba models cannot demonstrate its strength in handling short sequences, an area where causal attention mechanism performs well with ease. However, in scenarios that involve long sequences, the causal attention approach falters due to the quadratic complexity. In this scenario, the Mamba framework showcases its efficiency in merging memory with the current input, and is able to handle long sequences smoothly, indicating the Mamba family of models is well-suited for processing long sequences. 

It is also worth noting that on one hand where the recurrent nature of the state space model allows the Mamba models to efficiently handle long sequences, it introduces a certain limitation as it can access information only from the current and previous timesteps, and this type of token mixing is termed causal mode, and illustrated in the following figure. Due to its causal nature, this method is suited for autoregressive generation tasks

The fully-visible mode is suitable for understanding tasks where the model can access all the inputs at once. Furthermore, attention is in fully-visible mode by default, and it can be turned into causal mode easily by applying causal masks to the attention maps, and RNN-like models operate inherently in causal mode due to their recurrent properties. To summarize things, the Mamba framework is suited for tasks that either involve processing long sequences, or tasks that require causal token mixing mode.

Visual Recognition Tasks, Causal Token Mixing Code, and Very Large Sequences

As discussed earlier, the fully-visible token mixing mode allows unrestricted range of mixing whereas the causal mode limits the current token to access only the information from the preceding tokens. Furthermore, visual recognition is categorized as an understanding task where the model can see the entire image at once, and this eliminates the need for restrictions on token mixing, and imposing additional constraints on token mixing can degrade the model performance potentially. Generally, the fully-visible mode is appropriate for understanding tasks whereas the casual mode suits autoregressive tasks better. Furthermore, this claim is supported further by the fact that BeRT and ViT models are used for understanding tasks more than GPT models.

Experimental Verification and Results

The next step is to verify the hypotheses proposed by the MambaOut framework experimentally. As demonstrated in the following image, the Mamba block is based on the Gated Convolutional Neural Network block, and the meta-architecture of the Mamba and Gated CNN blocks can be treated as a simplified integration of the token mixer of MetaFormer framework, and an MLP. 

The Mamba block extends the Gated Convolutional Neural Network with an additional State Space Model, and the presence of an SSm is what distinguishes the Gated CNN and the Mamba block. Furthermore, to improve the practical speed, the MambaOut framework conducts only depthwise convolution on partial channels, and as demonstrated in the following algorithm, the implementation of the Gated CNN block is simple, yeet effective and elegant. 

Image Classification Task

ImageNet serves as the benchmark for image classification tasks as it consists of over a thousand common classes, over 1.3 million training images, and over 50,000 validation images. The data augmentation used for the experiment consists of random resized crop, Mixup, color jitter, Random Erasing, CutMix, and Rand Augment. The following table summarizes the performance of the Mamba family of models, MambaOut model, and other attention-based & convolution models on the ImageNet dataset. As it can be seen, the MambaOut framework without the state space model outperforms visual Mamba models with SSM consistently across all model sizes. 

For example, the MambaOut-Small model returns a top-1 accuracy score of over 84%, 0.4% higher than its nearest Mamba competitor. This result strongly supports the first hypothesis that claims that introducing a state space model for image classification tasks is not needed. 

Object Detection and Instance Segmentation Tasks

COCO serves as a benchmark for object detection and instance segmentation tasks. Although the MambaOut framework is able to surpass the performance of some visual Mamba models, it still falls short of state of the art visual Mamba models including LocalVMamba and VMamba. The disparity in performance of MambaOut against state of the art visual models emphasizes on the benefits of integrating the Mamba family of models in long-sequence visual tasks. However, it is worth noting that a significant performance gap still exists between state of the art convolution-attention-hybrid models and visual Mamba models. 

Final Thoughts

In this article, we have discussed the concepts of the Mamba family of models, and concluded that it is suited for tasks involving autoregressive and long-sequence characteristics. MambaOut is a work that attempts to delve into the essence of the Mamba framework, and answer whether Mamba is ideally suited for tasks with autoregressive and long-sequence characteristics. The MambaOut framework hypothesizes that Mamba is not necessary for vision tasks since image classification does not align either with long-sequence or autoregressive characteristics. Although segmentation and detection tasks are also not autoregressive, they display long-sequence characteristics, leading the MambaOut framework to hypothesize the potential of Mamba for these tasks. The MambaOut framework is constructed by stacking Mamba blocks on top of one another while removing the state space model, its core token mixer. The experimental results support the hypothesis put forward by the MambaOut framework since it is able to surpass all the visual Mamba models on the ImageNet image classification framework, indicating the Mamba is not necessary for vision tasks. On the other hand for detection and segmentation tasks, MambaOut framework is unable to replicate the performance offered by state of the art Mamba model, demonstrating the potential of the Mamba family of models for long-sequence visual tasks.