Journal Article10.48550/arxiv.2404.07987
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
Ming Hui Li,Taojiannan Yang,Huafeng Kuang,Jie Wu,Zhaoning Wang,Xuefeng Xiao,Chen Chen +6 more
16
TL;DR: ControlNet++ improves controllability of text-to-image diffusion models by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls.
read more
Abstract: To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
OmniGen: Unified Image Generation
Shitao Xiao,Yueze Wang,Junjie Zhou,Huaying Yuan,Xingrun Xing,Rui Yan,Shuting Wang,Tiejun Huang,Zheng Liu +8 more
- 17 Sep 2024
TL;DR: OmniGen, a unified diffusion model, integrates text-to-image generation with image editing, subject-driven generation, and visual-conditional generation, eliminating the need for additional modules and simplifying the workflow through unified knowledge transfer and chain-of-thought mechanism.
4
ControlAR: Controllable Image Generation with Autoregressive Models
Zongming Li,Tianheng Cheng,Shoufa Chen,Peize Sun,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang +8 more
TL;DR: ControlAR introduces a framework for integrating spatial controls into autoregressive image generation models, enabling efficient and effective control-to-image generation with conditional decoding, surpassing state-of-the-art controllable diffusion models in controllability and image quality.
FASTER: Face Attribute Sliders with Semantic Rewards
Jingyan Chen,Lanxiang Zhou,Han Fang,Zerun Feng,Chao Ban,Yaqi Li +5 more
- 06 Apr 2025
TL;DR: FASTER proposes a method for face attribute editing using stable diffusion models, achieving 98.67% editing accuracy and 10% improved attribute preservation on CelebA-HQ, with a 6x reduction in training time through efficient one-step reward learning.
EEG-driven natural image reconstruction with regional semantic awareness
Xin Xiang,Wenhui Zhou,Haonan Zhu,Yunrui Li,Guojun Dai,Lili Lin +5 more
On mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss
Yeongtak Oh,Saehyung Lee,Uiwon Hwang,Sungroh Yoon +3 more
- 01 Jan 2024
TL;DR: Mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss achieves superior morphing results on images and videos.
References
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
Fully convolutional networks for semantic segmentation
Jonathan Long,Evan Shelhamer,Trevor Darrell +2 more
- 07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
•Posted Content
MobileNetV2: Inverted Residuals and Linear Bottlenecks
TL;DR: A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.
13.9K
•Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
•Posted Content
Denoising Diffusion Probabilistic Models
TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.