Some Cool Details About Llama 3

Solid performance, new tokenizer, fairly optimal training and other details about Meta AI’s new model.

Some Cool Details About Llama 3

Created Using Ideogram

Next Week in The Sequence:

  • Edge 389: In our series about autonomous agents, we discuss the concept of large action models(LAMs). We review the LAM research pioneered by the team from Rabbit and we dive into the MetaGPT framework for multi-agent systems.

  • Edge 390: We dive into Databricks’ new impressive model: DBRX.

You can subscribed to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Some Cool Details About Llama 3

I had an editorial prepared for this week’s newsletter, but then Meta AI released Llama 3! Such are the times we live in. Generative AI is evolving on a weekly basis, and Llama 3 is one of the most anticipated releases of the past few months.

Since the debut of the original version, Llama has become one of the foundational blocks of the open source generative AI space. I prefer to use the term “open models,” given that these releases are not completely open source, but that’s just my preference.

The release of Llama 3 builds on incredible momentum within the open model ecosystem and brings its own innovations. The 8B and 70B versions of Llama 3 are available, with a 400B version currently being trained.

The Llama 3 architecture is based on a decoder-only model and includes a new, highly optimized 128k tokenizer. This is quite notable, given that, with few exceptions, most large language models simply reuse the same tokenizers. The new tokenizer leads to major performance gains. Another area of improvement in the architecture is the grouped query attention, which was already used in Llama 2 but has been enhanced for the larger models. Grouped query attention helps improve inference performance by caching key parameters. Additionally, the context window has also increased.

Training is one area in which Llama 3 drastically improves over its predecessors. The model was trained on 15 trillion tokens, making the corpus quite large for an 8B parameter model, which speaks to the level of optimization Meta achieved in this release. It’s interesting to note that only 5% of the training corpus consisted of non-English tokens. The training infrastructure utilized 16,000 GPUs, achieving a throughput of 400 TFLOPs, which is nothing short of monumental.

Llama 3 is a very welcome addition to the open model generative AI stack. The initial benchmark results are quite impressive, and the 400B version could rival GPT-4. Distribution is one area where Meta excelled in this release, making Llama 3 available on all major machine learning platforms. It’s been just a few hours, and we are already seeing open source innovations using Llama 3. The momentum in the generative AI open models space definitely continues, even if it forced me to rewrite the entire editorial. 😊

🔎 ML Research

VASA-1

Microsoft Research published a paper detailing VASA-1, a framework for generating talking faces from static images and audio clips. The model is able to generage facial gestures such as head or lip movements in a very expressive way —> Read more.

Zamba

Zyphra published a paper introducing Zamba, a 7B SSM model. Zamba introduces a new architecture that combines Mamba blocks with attention layers which leads to high performance in training and inference with lower computational resources —> Read more.

MEGALODON

AI researchers from Meta and Carnegie Mellon University published a paper introducing MEGALODON, a new architecture that can scale to virutally unlimited context windows. As it names indicates, MEGALODON is based on the MEGA architecture with an improved gated attention mechanism —> Read more.

SAMMO

Microsoft Research published a paper detailing Structure-Aware Multi-objective Metaprompt Optimization (SAMMO), a framework for prompt optimization. The framework is able to optimize prompts for scenarios such as RAG or instruction tuning —> Read more.

Infini-Attention

Google Research published a paper introducing Infini-Attention, a method to scale the context window in transformer architectures to virtually unlimited levels. The method adds a compressive memory into the attention layer which allow to build long-term and masked-local attention into a single transformer block —> Read more.

AI Agents Ethics

Google DeepMind published a paper discussing ethical considerations in AI assistants. The paper cover aspects such as safety alingment, safety and misuse —> Read more.

🤖 Cool AI Tech Releases

Llama 3

Meta AI introduced the highly anticipated Llama 3 model —> Read more.

Stable Diffusion 3

Stability AI launched the APIs for Stable Diffusion 3 as part of its developer platform —> Read more.

Reka Core

Reka, an AI startup built by former DeepMind engineers, announced its Reka Core multimodal models —> Read more.

OpenEQA

Meta AI released OpenEQA, a benchmark for visual language model in physical environments —> Read more.

Gemini Cookbook

Google open sourced the Gemini Cookbook, a series of examples for interacting with the Gemini API —> Read more.

🛠 Real World ML

AI Privacy at Slack

Slack discusses the architecture enabling privacy capabilities in its AI platform —> Read more.

📡AI Radar