Generative Audio Models Just Had a Great Week

Three major generative audio released in the last seven days.

Generative Audio Models Just Had a Great Week

Created Using Ideogram

Next Week in The Sequence:

  • Edge 385: Our series about autonomous agents continues by discussing the two biggest architecture approaches to build agents. We review the research behind Adept’s Fuyu-8B model which powers its agent platform and dive into Microsoft’s Augen framework for collaborative agents.

  • Edge 386: We dive into Yi, the massive Chinese LLM that is matching up to top western alternatives.

You can subscribed to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Generative Audio Models Just Had a Great Week

Audio is rapidly becoming one of the most important frontiers in generative AI, a field that is advancing swiftly. Several factors contribute to this rapid evolution. Technically, generative audio poses a fundamentally simpler problem than video or 3D, which leads to faster iterations in research and implementation. Additionally, from a model standpoint, many of the techniques, such as diffusion that pioneered text-to-image generation, are quite applicable to audio. From a market perspective, there exists a rich set of audio datasets that can be used to train new models, and the impact on industries such as media, robotics, or home automation can be quite immediate.

The pace of innovation in generative audio is accelerating at remarkable levels. Just a few days ago, OpenAI shared some details about Voice Engine, a new model for synthetic voice generation. Last week, we saw several major releases related to generative audio:

  1. Stability AI open-sourced the new version of Stable Audio 2.0, which can generate music tracks up to three minutes long.

  2. Assembly AI released a multilingual speech-to-text model called Universal-1.

  3. Resemble AI announced a high-quality speed resolution model called Resemble Enhance.

To these innovations, you need to add more established players such as Eleven Labs, which have been pushing the boundaries of generative audio for years. Generative audio might be inching towards its ChatGPT moment faster than we think.”

🔎 ML Research

Jamba

Following their open source release( which we covered last week) AI21 published the paper detailing Jamba. The model presents a unique transformer-Mamba MoE architecture that looks the leverage the best of both approaches —> Read more.

Many-Shot Jailbreaking

Anthropic published a paper detailing many-shot jailbreaking, a technique that can bypass traditional guardrails in LLMs including Claude. The technique exploits large context windows in LLMs that allows an attacker to position malicious text in different positions —> Read more.

Gekko

Google DeepMind published a paper discussing Gekko, a hyper efficient text embedding model that can achieve knowledge generalization with relatively small data. Gekko uses a very simple architecture that leverages knowledge from LLMs into a retriever —> Read more.

MambaMixer

Researchers from Cornell University published a paper outlining MambaMixer an architecture block that could be added to State Space Models(SSMs). MambaMixer uses a dual selection mechanism that combines tokens across different modalities —> Read more.

Mixture-of-Depth

Google DeepMind published a paper detailing a technique for allocating compute to specific positions in a sequence instead of the entire sequence. This approach really optimizes compute in transformer models to the point of capping the number of tokens that can participate in the attention layers —> Read more.

ReFT

Researchers from Stanford University published a paper introducing representation fine-tuning(ReFT), a technique that looks to edit representations in LLMs via fine-tuning. ReFT banks on the idea that representations encode rich semantic information which can lead to more effective fine-tuning —> Read more.

🤖 Cool AI Tech Releases

Stable Audio 2.0

Stability AI released Stable Audio 2.0 which can generate musical tracks up to 3 mins long —> Read more.

Universal-1

AssemblyAI launched Universal-1, its multilingual speech-to-text model —> Read more.

Command R+

Cohere introduced Command R+, the new version of its LLM optimized for RAG and tool usage —> Read more.

OpenAI Custom Models

OpenAI announced enhacements to its training API as well as new mechanisms for building custom models —> Read more.

Ressemble Enhance

Resemble AI released Ressemble Enhance, a high quality speech resolution model —> Read more.

🛠 Real World ML

Text-To-SQL at Pinterest

The Pinterest team shared details about their use of text-to-sql models for analytics workflows —> Read more.

ML Lifecycle Management at Salesforce

Salesforce discusses details about their ML Console for managing the lifecycle of internal ML workloads —> Read more.

📡AI Radar

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.