Journal Article10.48550/arxiv.2402.15391
Genie: Generative Interactive Environments
Jake Bruce,Michael Dennis,Ashley Edwards,Jack Parker-Holder,Yuge Shi,Edward Hughes,Matthew Lai,Aditi Mavalankar,Richie Steigerwald,Chris Apps,Yusuf Aytar,Sarah Bechtle,Feryal Behbahani,Stephanie Chan,Nicolas Heess,Lucy Gonzalez,Simon Osindero,Sherjil Ozair,Scott Reed,Jingwei Zhang,Konrad Żołna,Jeff Clune,Nando de Freitas,Satinder Singh,Tim Rocktaschel +24 more
39
TL;DR: Genie is introduced, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos, which enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature.
read more
Abstract: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Shenyuan Gao,Jiazhi Yang,Lin Chen,Kashyap Chitta,Yongqiang Qiu,Andreas Geiger,Jun Zhang,Hongyang Li +7 more
- 27 May 2024
TL;DR: Vista is a generalizable driving world model with high fidelity and versatile controllability. It addresses limitations in generalization, prediction fidelity, and action controllability through novel losses, latent replacement, and efficient learning strategies.
Pyramidal Flow Matching for Efficient Video Generative Modeling
Yang Jin,Zhicheng Sun,Ningyuan Li,Kun Xu,Kun Xu,Hao Jiang,Nan Zhuang,Quzhe Huang,Yang Song,Yadong Mu,Zhouchen Lin +10 more
- 08 Oct 2024
TL;DR: This work introduces Pyramidal Flow Matching, a unified algorithm for efficient video generative modeling, enabling high-quality video generation at 768p resolution and 24 FPS within 20.7k A100 GPU training hours with a single Diffusion Transformer.
Diffusion Policy Policy Optimization
Allen Z. Ren,Justin Lidard,Lars Lien Ankile,Anthony Simeonov,Pulkit Agrawal,Anirudha Majumdar,Benjamin Burchfiel,Hongkai Dai,Max Simchowitz +8 more
- 31 Aug 2024
TL;DR: DPPO, a novel algorithm, optimizes diffusion-based policies in continuous control and robot learning tasks using policy gradient methods, achieving stronger performance and efficiency than other RL methods, leveraging synergies between RL fine-tuning and diffusion parameterization.
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Jing Wu,Shu Yin,Ningya Feng,Xu He,Dong Liu,Jianye Hao,Mingsheng Long +6 more
- 24 May 2024
TL;DR: iVideoGPT is a scalable world model framework that integrates visual observations, actions, and rewards into a sequence of tokens, enabling interactive exploration and planning in imagined environments.
References
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Hierarchical Text-Conditional Image Generation with CLIP Latents
TL;DR: This work proposes a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the imageembedding, and shows that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
4.3K