TokenFlow: Consistent Diffusion Features for Consistent Video Editing
19 Jul 2023
116
TL;DR: In this article , a text-to-image diffusion model is proposed to generate a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
read more
Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
VBench: Comprehensive Benchmark Suite for Video Generative Models
Ziqi Huang,Yinan He,Jiashuo Yu,Fan Zhang,Chenyang Si,Yuming Jiang,Yuanhan Zhang,Tianxing Wu,Qingyang Jin,Nattapol Chanpaisit,Yaohui Wang,Xinyuan Chen,Limin Wang,Dahua Lin,Yu Qiao,Ziwei Liu +15 more
TL;DR: VBench is a comprehensive benchmark suite that dissects video generation quality into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods, and provides a dataset of human preference annotations to validate the benchmarks' alignment with human perception.
105
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk,Lijun Yu,Xiuye Gu,Jos'e Lezama,Jonathan Huang,Rachel Hornung,Hartwig Adam,Hassan Akbari,Y. Alon,Vighnesh Birodkar,Yong Cheng,Ming-Chang Chiu,Josh Dillon,Irfan Essa,Agrim Gupta,Meera Hahn,Anja Hauth,David Hendon,Alonso Martinez,David Minnen,David A. Ross,Grant Schindler,Mikhail Sirotenko,Kihyuk Sohn,Krishna Somandepalli,Huisheng Wang,Jimmy Yan,Ming Yang,Xuan Yang,Bryan Seybold,Lu Jiang +30 more
TL;DR: Empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation are presented, specifically highlighting VideoPoet's ability to generate high-fidelity motions.
88
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Weifeng Chen,Jie Wu,Pan Xie,Hefeng Wu,Jiashi Li,Xin Xia,Xuefeng Xiao,Liang-Jin Lin +7 more
TL;DR: In this article , a controllable text-to-video diffusion model, named Video-ControlNet, is proposed to generate videos conditioned on a sequence of control signals, such as edge or depth maps.
76
State of the Art on Diffusion Models for Visual Computing
Ryan Po,Wang Yifan,Vladislav Golyanik,Kfir Aberman,Jonathan T. Barron,Amit H. Bermano,Eric Ryan Chan,Tali Dekel,Aleksander Holynski,Angjoo Kanazawa,C. K. Liu,Lingjie Liu,Ben Mildenhall,Matthias Nießner,Bjorn Ommer,Christian Theobalt,Peter Wonka,Gordon Wetzstein +17 more
TL;DR: The basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model, as well as overview important aspects of these generative AI tools, including personalization, conditioning, inversion, among others are introduced.
A Survey on Video Diffusion Models
Zhen Xing,Qijun Feng,Haoran Chen,Qi Dai,Hang-Rui Hu,Hang Xu,Zuxuan Wu,Yu-Gang Jiang +7 more
TL;DR: This paper presents a comprehensive review of video diffusion models in the AIGC era, with a concise introduction to the fundamentals and evolution of diffusion models, and presents an overview of research on diffusion Models in the video domain.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger,Philipp Fischer,Thomas Brox +2 more
- 05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
•Posted Content
Denoising Diffusion Probabilistic Models
TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
•Posted Content
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
TL;DR: This work develops an approach to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process, then learns a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.
Hierarchical Text-Conditional Image Generation with CLIP Latents
TL;DR: This work proposes a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the imageembedding, and shows that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
4.3K