Key Points

1. Genie is introduced as the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos, capable of converting different prompts into interactive, playable environments.

2. The model, consisting of a spatiotemporal video tokenizer, autoregressive dynamics model, and a latent action model, is trained on a large dataset of over 200,000 hours of publicly available Internet gaming videos.

3. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements.

4. The resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

5. The architecture of the model scales gracefully with additional computational resources, leading to a final 11B parameter model.

6. The model is trained on a filtered set of 30,000 hours of Internet gameplay videos from hundreds of 2D platformer games, producing a foundation world model for this setting.

7. Genie has the capability to generate diverse trajectories in unseen RL environments and successfully learns distinct and consistent actions from video data without the need for text or action labels.

8. Genie provides improved controllability, high video generation quality, and agentic goal through the learned latent action space, and enables the emulation of parallax, deformable objects, and diverse trajectories in different environments.

9. The potential societal impact of Genie is highlighted, as it could enable individuals to generate their own game-like experiences and empower the next generation of playable world development.

Summary

The research paper introduces Genie, the first generative interactive environment trained from unlabelled Internet videos. At 11B parameters, Genie is capable of generating action-controllable virtual worlds described through various mediums. The model comprises a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model. Genie allows users to act in the generated environments, enabling training agents to imitate behaviors from unseen videos. The model is trained from a large dataset of Internet gaming videos and can generate interactive environments from a single prompt. The paper also discusses the scaling behavior of the model and its generalizability to various domains.

The model is evaluated on its ability to generate diverse trajectories and simulate physical properties, such as parallax and deformable objects. Additionally, the paper outlines the architectural design choices and the training details of the model. Overall, Genie presents a novel approach to generative AI, opening new pathways for training generalist agents and enabling users to create and interact with virtual worlds. The paper acknowledges the potential societal impact of Genie and provides considerations regarding training data, reproducibility, and contributors to the project.

Reference: https://arxiv.org/abs/2402.153...