Apple Goes Small and Super Multimodal

Plus a lot of new models being released and quite an active week for AI VCs.

Next Week in The Sequence:

Edge 411: We dive into episodic memory and autonomous agents including the Larimar research by Princeton University. We also explore the Chroma database stack.
Edge 412: We review three recent papers from Microsoft about privacy in generative AI.

You can subscribe to The Sequence below:

Editorial: Apple Goes Small and Super Multimodal

Apple has been late to the generative AI game, but lately, it has been pushing the research agenda quite hard. Apple has an ideal playground for innovating in one of the hottest areas of the next wave of generative AI: on-device multimodal models. The idea of powering mobile AI through API integrations with massively large foundation models seems highly impractical and insecure, and Apple is in a unique position to power alternatives to this paradigm. However, most of Apple’s efforts in small on-device models have been somewhat underwhelming.

That definitely changed last week. Building on recent research, Apple published a paper and unveiled a demonstration of 4M-21, an any-to-any vision model trained for many tasks and modalities. 4M-21 builds on the original 4M research; the new model expands from 7 to 21 modalities, including highly specific ones such as edges, geometric, semantic, and feature maps. Perhaps the biggest contribution of 4M-21 is the ability to train a single model across language and vision simultaneously without virtually sacrificing performance. The secret? A series of modality-specific tokenizers that not only optimize multimodal learning but do so without the known challenges of large models.

4M-21 is impressive, and its approach seems to address the key first principles of the type of models needed in iOS devices. In case it’s not obvious, Apple is no longer quiet in generative AI.

ML Research

A Small Any-to-Any Model

Apple Research published a paper introducing 4M-21, a small multimodal model optimized for tens of different tasks and modalities. The core innovation of 4M-21 is to train on a diverse set of modalities without sacrificing performance —> Read more.

Generating Function Calling Datasets

Salesforce Research published the paper and source code for APIGen, a pipeline for creating function calling datasets. APIGen emphasizes on teh verifiability and diversity of the datasets which drastically improves the performance of LLMs in function calling tasks —> Read more.

RouteLLM

LMSys published a paper and open source code for RouteLLM, a new technique that selects model at inference time based on performance. RouteLLM provides the training mechanisms to build routers based on human preferences or data augmentation —> Read more.

New Model Evals

Anthropic published an insightful post outlining a new initiative for creating third party evaluations for foundation models. The post discusses the requirements and challenges to new evaluations and its relevance for improving foundation models —> Read more.

Text to 3D

Meta AI published a paper introducing 3D Gen, a pipeline for text-to-3D asset generation. The method combines two key components: AssetGen and TextureGen to generate high fidelity 3D assets in under a minute —> Read more.

Unlearning is not Enough

Google DeepMind published a paper introducing unUnlearning, a technique to reintroduce unlearned knowledge in context in a way that makes LLMs behaves like they know the forgotten knowledge. The paper argues unLearning is not enough for content regulation and new techniques are required —> Read more.

Summarization Challenges

Salesforce Research published a paper outlining Summary of a Haystack( SumHay), a challenge for evaluating summarization in long context LLMs and RAG systems. SumHay uses a carefully crafted set of documents with repeated insights and evaluates LLM summaries in the effectiveness of citing those insights —> Read more.