Journal Article10.48550/arXiv.2301.13173
Shape-aware Text-driven Layered Video Editing
30
TL;DR: The authors propagate the deformation field between the input and edited keyframe to all frames and leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
read more
Abstract: Temporal consistency is essential for video editing applications. Existing work on layered representation of videos allows propagating edits consistently to each frame. These methods, however, can only edit object appearance rather than object shape changes due to the limitation of using a fixed UV mapping field for texture atlas. We present a shape-aware, text-driven video editing method to tackle this challenge. To handle shape changes in video editing, we first propagate the deformation field between the input and edited keyframe to all frames. We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions. The experimental results demonstrate that our method can achieve shape-aware consistent video editing and compare favorably with the state-of-the-art.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Levon Khachatryan,A. Movsisyan,Vahram Tadevosyan,Roberto Henschel,Zhangyang Wang,Sh. Navasardyan,Honghui Shi +6 more
TL;DR: Text2Video-Zero as discussed by the authors proposes a low-cost approach by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
TL;DR: FateZero as mentioned in this paper proposes a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask, which captures intermediate attention maps during inversion, which effectively retain both structural and motion information.
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
19 Jul 2023
TL;DR: In this article , a text-to-image diffusion model is proposed to generate a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
116
State of the Art on Diffusion Models for Visual Computing
Ryan Po,Wang Yifan,Vladislav Golyanik,Kfir Aberman,Jonathan T. Barron,Amit H. Bermano,Eric Ryan Chan,Tali Dekel,Aleksander Holynski,Angjoo Kanazawa,C. K. Liu,Lingjie Liu,Ben Mildenhall,Matthias Nießner,Bjorn Ommer,Christian Theobalt,Peter Wonka,Gordon Wetzstein +17 more
TL;DR: The basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model, as well as overview important aspects of these generative AI tools, including personalization, conditioning, inversion, among others are introduced.
A Survey on Video Diffusion Models
Zhen Xing,Qijun Feng,Haoran Chen,Qi Dai,Hang-Rui Hu,Hang Xu,Zuxuan Wu,Yu-Gang Jiang +7 more
TL;DR: This paper presents a comprehensive review of video diffusion models in the AIGC era, with a concise introduction to the fundamentals and evolution of diffusion models, and presents an overview of research on diffusion Models in the video domain.
References
•Posted Content
Denoising Diffusion Probabilistic Models
TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
•Proceedings Article
Spatial transformer networks
Max Jaderberg,Karen Simonyan,Andrew Zisserman,Koray Kavukcuoglu +3 more
- 07 Dec 2015
TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.
Principal warps: thin-plate splines and the decomposition of deformations
TL;DR: The decomposition of deformations by principal warps is demonstrated and the method is extended to deal with curving edges between landmarks to aid the extraction of features for analysis, comparison, and diagnosis of biological and medical images.
5.5K
•Posted Content
Denoising Diffusion Implicit Models
TL;DR: Denoising diffusion implicit models (DDIMs) are presented, a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs that can produce high quality samples faster and perform semantically meaningful image interpolation directly in the latent space.
3.8K
StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks
Han Zhang,Tao Xu,Hongsheng Li +2 more
- 01 Oct 2017
TL;DR: This paper proposes Stacked Generative Adversarial Networks (StackGAN) to generate 256 photo-realistic images conditioned on text descriptions and introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold.