Edge 384: Inside Genie: Google DeepMind’s Astonishing Model that can Build 2D Games from Text and Images

The model represents a new category in generative AI.

Created Using Ideogram

The pace of research in generative AI is nothing short of remarkable. Even so, from time to time, there are papers that literally challenge our imagination about how far the generative AI space can go. A few weelks ago, Google DeepMind published some of that work with the release of Genie, a model that is able to generative interactive game environments from text and images.

If we think video generation with models like Sora areimpressive, imagine inferring interacting actions in those videos. Imagine a scenario where the vast array of videos available on the Internet could serve as a training ground for models to not only create new images and videos but also to forge entire interactive environments. This is the vision that Google DeepMind has turned into reality with Genie, a groundbreaking approach to generative AI. Genie can craft interactive environments from a mere text or image prompt, thanks to its training on over 200,000 hours of publicly available gaming videos from the Internet. What makes Genie stand out is its ability to be controlled on a frame-by-frame basis through a learned latent action space, even though it was trained without specific action or text annotations.

Genie’s Architecture