Genie: Generative Interactive Environments

doi:10.48550/arxiv.2402.15391

Journal Article10.48550/arxiv.2402.15391

Genie: Generative Interactive Environments

Jake Bruce, +24 more

- 23 Feb 2024

- arXiv.org

- Vol. abs/2402.15391

39

TL;DR: Genie is introduced, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos, which enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Preprint•10.48550/arxiv.2405.17398

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

Shenyuan Gao, +7 more

- 27 May 2024

TL;DR: Vista is a generalizable driving world model with high fidelity and versatile controllability. It addresses limitations in generalization, prediction fidelity, and action controllability through novel losses, latent replacement, and efficient learning strategies.

...read moreread less

12

Journal Article•10.48550/arxiv.2410.05954

Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, +10 more

- 08 Oct 2024

TL;DR: This work introduces Pyramidal Flow Matching, a unified algorithm for efficient video generative modeling, enabling high-quality video generation at 768p resolution and 24 FPS within 20.7k A100 GPU training hours with a single Diffusion Transformer.

...read moreread less

8

Journal Article•10.48550/arxiv.2409.00588

Diffusion Policy Policy Optimization

Allen Z. Ren, +8 more

- 31 Aug 2024

TL;DR: DPPO, a novel algorithm, optimizes diffusion-based policies in continuous control and robot learning tasks using policy gradient methods, achieving stronger performance and efficiency than other RL methods, leveraging synergies between RL fine-tuning and diffusion parameterization.

...read moreread less

6

Preprint•10.48550/arxiv.2405.15223

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jing Wu, +6 more

- 24 May 2024

TL;DR: iVideoGPT is a scalable world model framework that integrates visual observations, actions, and rewards into a sequence of tokens, enabling interactive exploration and planning in imagined environments.

...read moreread less

5

Journal Article•10.1109/cog60054.2024.10645597

Game Generation via Large Language Models

Chengpeng Hu, +2 more

- 05 Aug 2024

4

...

Expand

References

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K

Journal Article•10.48550/arXiv.2204.06125

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, +4 more

- 13 Apr 2022

- arXiv.org

TL;DR: This work proposes a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the imageembedding, and shows that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.

...read moreread less

4.3K

...

Expand