One of the most impressive research in generative video of the last year.
Video understanding might become the next frontier for generative AI. Building AI models and agents that fully understand complex environments have long been one of the goals of AI. The recent generative AI revolution have expanded the horizons of AI models in order to understand environments using language, video and images. Obviously, video understanding seems to be the key to unlock this capability as videos include features such as object interaction, physics and other key characteristics of real world settings. A group of AI researchers from UC Berkeley that include AI legend Peiter Abbeel published a paper proposing a model that can learn complex representations from images and videos in seuqences of up to one million tokens. They named the model: large world model(LWM).
The Problem
Today’s language models have difficulty grasping world aspects that are challenging to encapsulate solely through text, especially when it comes to managing intricate, extended tasks. Videos provide a rich source of temporal information that static images and text cannot offer, highlighting the potential benefits of integrating video with language in model training. This integration aims to create models that comprehend both textual knowledge and the physical world, broadening AI’s potential to assist humans. Nevertheless, the ambition to learn from millions of tokens spanning video and language sequences is hampered by significant hurdles such as memory limitations, computational challenges, and the scarcity of comprehensive datasets.