The Most Important Algorithm for Transformers

FlashAttention has a new version. Plus some important research milestones and major funding activity in AI.

Created Using Ideogram

Next Week in The Sequence:

  • Edge 413: Our series about autonomous agents continues with an exploration of semantic memory. We review Meta AI’s MM-LLM research to augment video models with memory and we dive into the Qdrant vector DB stack.

  • Edge 414: We dive into HUSKY, a new agent optimized for multi-step reasoning.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: The Most Important Algorithm for Transformers

There are few algorithms that have had as much impact on the recent generation of transformer architectures as FlashAttention. Originally developed by researchers from Princeton University, including the renowned Tri Dao, FlashAttention and its successor FlashAttention-2 were able to improve the performance of attention mechanisms in GPUs by minimizing read-writes. Almost immediately after the original publication, FlashAttention was rapidly adopted within the new generation of transformers. There were not many complaints about FlashAttention, but one of the few was that it was unable to take full advantage of new hardware architectures. For instance, FlashAttention-2 is only able to achieve 35% utilization of max FLOPs in H100 GPUs.

But now we have a new version.

Last week, a group of AI researchers from Meta, Princeton University, NVIDIA, and other AI labs published the paper and open-source code for FlashAttention-3. The new version of the method uses several techniques to speed up attention in H100 GPUs, exploiting the asynchrony of the tensor cores. The result is simple: FlashAttention-3 is blazing fast. The new model achieves 75% theoretical max FLOP utilization in H100, which results in practical 1.5-2x performance improvements. The new algorithm is also able to use lower precision numbers, which reduces the memory footprint.

FlashAttention-3 is an exciting development in generative AI algorithms. This method will almost certainly lead to improvements in large context windows in LLMs and better inference performance on modern GPU architectures. Impressive progress!

🔎 ML Research

FlastAttention-3

A group of AI researchers from Meta, Princeton University, Together AI, NVIDIA and others published a paper unveiling the new version of the famous FlastAttention algorithm. FlashAttention-3 takes advantages of the latest GPU advancements achieving 2x the performance of its predecessor and also exceling in long context LLM tasks —> Read more.

Sub-Billion Parameter Models for Mobile

Meta AI published a paper introducing MobileLLM, a sub-billion parameter model optimized for on-device scenarios. MobileLLM uses a specific structure of embedding and attention layers that optimizes its efficiency relative to its size —> Read more.

Generative Teaching for Agents

Microsoft Research published a paper unveiling AgentInstruct, an agentic framework for creating syntethic data. Specifically, AgentInstruct focuses on datasets used for instruction tuning of base models —> Read more.

Evaluating Multimodal Foundation Models

Researchers from Carnegie Mellon University published a paper introducing the holitic evaluation of multimodal models(HEMM) framework . HEMM sets the primitives to systematically evaluate multimodal models across different tasks such as basic skills, information flow, and real-world use cases —> Read more.

A Unified AI Database

Microsoft Research published a paper proposing VBase, the foundation for a unified database for vector, relational and scalar data types. The core of VBase is based on a property called relaxed monotonicity that enables the unification of the different data types models —> Read more.

Contamination in Code Generation Benchmarks

Researchers from Cohere published a paper providing evidence of the levels of contamination of code generation benchmarks in major LLMs. The paper also proposes a Less Basic Python Problems, a new benchmark more resilient to contamination —> Read more.

Autoregressive Models for Text-Image Generation

The team bedhind the Generative AI Research Lab(GAIR) published a paper unveileing ANOLE, an autoregressive multimodal model for image and text generation. ANOLE is based on Meta AI’s Chameleon which guarantees a data and parameter efficient fine-tuning strategy —> Read more.

🤖 Cool AI Tech Releases

Claude High Quality Prompts

Anthropic released some features to evaluate and generate high quality prompts for Claude —> Read more.

MInference

Microsoft released some demos of its MInference method for optimizing LLM inference performance —> Read more.

AutoGen Models

Microsoft AutoGen added support for non OpenAI models —> Read more.

🛠 Real World AI

Ad Inference at Meta

Meta shares some details about the AI inference architecture powering its ad serving system —> Read more.

📡AI Radar

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.