A summary of our series about the most viable alternative to transformers.
💡 ML Concept of the Day: A Summary of Out Series About Space State Models
In the last few weeks, The Sequence has covered the fundamental concepts and research behind state space models(SSMs). Today, we would like to present a summary of this series about some of the most interesting trends in foundation models. This marks the end of this series. Next week we start a new and also deep technical series but you need to read until the end to find out the details.
What makes SSMs that interesting is that it is considered the most viable alternative to transformers.
While transformers are, by far, the most important architecture for foundation models they don’t come without limitations. The main one is the inference model that requires the entire sequence to be passed to the model every time a new output is generated. This posses major scalability limitations for long context tasks.
Previous architectures such as recurrent neural networks(RNNs)address some of these limitations but tend to forget information in long sequences and they are pretty hard to parallelize.
SSMs excel due to their recurrent properties, allowing the model to process only the latest input while retaining information from previous inputs. This efficiency stems from their mathematical design, making training and inference computationally efficient compared to older models like recurrent neural networks (RNNs). SSM-based architectures have demonstrated superior performance over Transformers in tasks requiring long-context understanding, as evidenced by benchmarks like the Long Range Arena (LRA). New models, such as Mamba, outperform state-of-the-art Transformers in both performance and computational efficiency for these tasks. These findings suggest that SSMs could address many of the limitations currently associated with Transformers. While SSMs show significant promise as foundational models, most research has concentrated on developing high-performing architectures and efficient implementations.
In general, SSMs bring some key capabilities that are relevant in the context of foundation models:
-
SSMs scale linearly with the context windows instead of quadratically like transformers.
-
SSMs have virtually no limitations in the context windows as it uses a completely recursive view.
-
The behavior of SSMs is fundamentally captured in a single matrix learned from the data.
Throughout this series, we discussed some of the most interesting concepts, research and technology associated with SSMs. Here is a brief summary:
-
Edge 421: Introduced our series about SSMs. Discussed the groundbreaking paper about the alignment of transformers and SSMs and reviewed the DeepCheck framework for monitoring and evaluating LLMs.
-
Edge 423: Discussed the fundamental equation of SSMs. Reviewed the S4 model that was one of the first SSMs to gain real adoption and also presented NVIDIA’s NIM framework to enable containerized deployment of AI models.
-
Edge 425: Dived into Mamba, arguably the most popular SSM model ever created. The issue reviewed the original Mamba paper and the GridTape framework for building LLM applications.
-
Edge 427: Reviewed Jamba, a model that combines SSMs, transformers and MoEs including the original Jamba paper. It also provided an overview of the DeepEval framework for LLM evaluation.
-
Edge 429: Discussed the idea of tokenization-free SSMs. Reviewed the MambaByte paper and the MindsDB platform for building AI systems.
-
Edge 431: Reviewed Cobra, a multimodal SSM including its original paper. It also introduced the NVIDIA’s TensorRT-LLM for fast inference.
-
Edge 433: Introduced SAMBA and the concept and SSMs for long-context windows. The issue reviews the original SAMBA paper and Microsoft’s Task Weaver agent for analytic workloads.
-
Edge 435: Dives into hungry hungry hippos(H3) which has become an important layer of SSMs. The installement reviews the H3 paper and Character.ai’s PromptPoet framework.
-
Edge 437: Discusses the BlackMamba model that combines SSMs and MoEs in a single architecture including its original paper. It also reviews the SWE-Agent for software engineering tasks.
-
Edge 439: Reviews Zamba, a model that combines SSMs and attention layers. We dived into Zamba’s original paper and review the LitServe framework for high performance model serving.
-
Edge 441: Explore SSMs for non-language modalities. It reviewed MFeta AI’s Multi-Head SSM for speech recognition and the Llama-Factory framework for pretraining LLMs.
I hope you enjoyed this series despite going super technical. Next week we start a new series about one of the hottest topics in foundation models: knowledge distillation!