The Most Amazing Week in Gen AI Releases

OpenAI, Google, Microsoft, Cohere and others shipped new models.

Created Using Midjourney

Next Week in The Sequence

  • Edge 457: Provides an overview of attention-based distillation including a major paper in that area. We also dive into the famous OmniParser framework for vision language models.

  • The Sequence Chat: Debates the complex subject of gen AI as a mechanism for discovering new science and some of its major limitations.

  • Edge 458: Discusses the Allen AI’s new Tulu 3 framework for post-training of foundation models.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Models, Models, Models: The Most Amazing Week in Gen AI Releases

As we are approaching the holidays it seems that every major AI lab decided to release their latest models. Without a doubt, last week has to be one of the most impressive weeks in the history of generative AI in terms of model releases with Microsoft, OpenAI , Google, Cohere and others shipping new models.

Take a look:

  • Sora: One of the most anticipated releases of OpenAI, Sora is a groundbreaking video generation model that brings text-to-video capabilities to the forefront. Sora allows users to create realistic videos from text prompts, extending, remixing, and blending existing assets or generating entirely new content. It features a new interface with a storyboard tool for precise input specification, alongside Featured and Recent feeds showcasing community creations. OpenAI acknowledges the limitations of this early version, particularly in generating realistic physics and handling complex actions over extended durations. They emphasize their commitment to responsible deployment, highlighting efforts to ensure transparency, mitigate deepfakes, and prevent misuse.

  • Gemini 2.0: The release of Gemini 2.0 marks a significant advancement in AI, ushering in what Google calls “the agentic era.” This new model builds on the multimodal capabilities of its predecessor, Gemini 1.0, and introduces native image and audio output, along with native tool use. These advancements enable Gemini 2.0 to perform more complex tasks, understand the world around users better, and even take actions on their behalf. Google emphasizes its commitment to responsible AI development, highlighting safety and security as key priorities in building these agentic experiences.

  • Command R7B: Command R7B, developed by Cohere, is the smallest model in their R series, focusing on speed, efficiency, and quality for building AI applications. Its key strength lies in its ability to run efficiently on commodity GPUs and edge devices, making it accessible for a wider range of developers and applications. This accessibility, combined with its top-tier performance, positions Command R7B as a valuable tool for creating powerful AI solutions without requiring high-end hardware.

  • Phi-4: Microsoft Phi was the model that started the SLM movement. Phi-4 distinguishes itself as a small language model specializing in complex reasoning tasks. Despite its compact size, Phi-4 achieves impressive performance in mathematical and logical reasoning, demonstrating its ability to handle intricate problems typically requiring larger models. This efficiency is attributed to advancements in training techniques and the utilization of high-quality synthetic datasets. Microsoft emphasizes the accessibility and responsible use of Phi-4, making it available on Azure AI and incorporating safety features like content filtering.

Even by generative AI standards, last week classifies as impressive. The level of innovation in this market is something the tech industry hasn’t seen since the personal computer revolution. Quite remarkable.

🔎 ML Research

Phi-4

In “Phi-4 Technical Report”, researchers from Microsoft Research developed a 14-billion parameter language model named phi-4. Phi-4 is trained using a data-centric approach that prioritizes data quality, incorporating synthetic data generated through multi-step prompting workflows and curated high-quality organic data —> Read more.

Bag of Nuggets

In “BoN Jailbreaking: Bypassing Safety Measures in Multi-Modal LLMs”, researchers from Anthropic, Speechmatics, MATS, UCL, Stanford University, University of Oxford, Tangentic, and others present a new method called “Bag of Nuggets” (BoN). BoN jailbreaking involves generating a large number of augmented inputs, typically 10,000, by applying various perturbations to a harmful request, such as character scrambling, random capitalization, and character noising —> Read more.

Video Creation by Demonstration

In “Video Creation by Demonstration”, researchers from Google DeepMind introduce a novel video generation task and a corresponding method called 𝛿-Diffusion. 𝛿-Diffusion allows users to create videos that continue from a given context image while incorporating action concepts from a demonstration video, enabling creative control and realistic video synthesis —> Read more.

JuStRank

In “JuStRank: Benchmarking LLM Judges for System Ranking”, researchers from IBM Research introduce JuStRank, the first large-scale benchmark for evaluating LLM judges for ranking target systems. The study examines various LLM judges and aggregation methods, comparing their system rankings to a human-based ranking, providing insights into judge behavior and bias —> Read more.

ScribeAgent

In “ScribeAgent: Fine-Tuning Open-Source LLMs for Enhanced Web Navigation”, researchers from Carnegie Mellon University present ScribeAgent, an approach that leverages fine-tuning of open-source LLMs on a large dataset of real-world web workflows. By fine-tuning Qwen models on a massive dataset of user-annotated web workflows, ScribeAgent surpasses GPT-4-based agents on various web navigation benchmarks —> Read more.

Meta’s New Research

Meta AI published a detailed list of recent research in terms of agentic workflows, and nnew architectures. The research includes Meta Motivo for controlling embodied agent behviors and Meta Video for watermarking —> Read more.

🤖 AI Tech Releases

Gemini 2.0

Google unveiled Gemini 2.0, its new model for agentic workflows —> Read more.

Sora

OpenAI released the first version of Sora, its highly anticipated text-to-video model —> Read more.

Command R7B

Cohere released Command R7B, a small LLM focused on enterprise AI apps —> Read more.

Fast-LLM

ServiceNow open sourced the framework Fast-LLM to streamline pretraining of foundation models —> Read more.

Pika 2.0

Pika announced its 2.0 model with quite a bit of new features —> Read more.

🛠 Real World AI

Inside Agentforce

Salesforce discusses the reasoning engine behind its Agentforce platform —> Read more.

📡AI Radar

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.