ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

doi:10.48550/arxiv.2404.07987

Journal Article10.48550/arxiv.2404.07987

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Ming Hui Li, +6 more

- 11 Apr 2024

- arXiv.org

- Vol. abs/2404.07987

16

TL;DR: ControlNet++ improves controllability of text-to-image diffusion models by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls.

Abstract: To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2409.11340

OmniGen: Unified Image Generation

Shitao Xiao, +8 more

- 17 Sep 2024

TL;DR: OmniGen, a unified diffusion model, integrates text-to-image generation with image editing, subject-driven generation, and visual-conditional generation, eliminating the need for additional modules and simplifying the workflow through unified knowledge transfer and chain-of-thought mechanism.

...read moreread less

4

Journal Article•10.48550/arxiv.2410.02705

ControlAR: Controllable Image Generation with Autoregressive Models

Zongming Li, +8 more

- 03 Oct 2024

- arXiv.org

TL;DR: ControlAR introduces a framework for integrating spatial controls into autoregressive image generation models, enabling efficient and effective control-to-image generation with conditional decoding, surpassing state-of-the-art controllable diffusion models in controllability and image quality.

...read moreread less

1

Proceedings Article•10.1109/icassp49660.2025.10889494

FASTER: Face Attribute Sliders with Semantic Rewards

Jingyan Chen, +5 more

- 06 Apr 2025

TL;DR: FASTER proposes a method for face attribute editing using stable diffusion models, achieving 98.67% editing accuracy and 10% improved attribute preservation on CelebA-HQ, with a 6x reduction in training time through efficient one-step reward learning.

...read moreread less

Journal Article•10.1016/j.patcog.2025.112589

EEG-driven natural image reconstruction with regional semantic awareness

Xin Xiang, +5 more

- 11 Oct 2025

- Pattern Recognition

Preprint•10.48550/arxiv.2401.10526

On mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss

Yeongtak Oh, +3 more

- 01 Jan 2024

TL;DR: Mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss achieves superior morphing results on images and videos.

...read moreread less

...

Expand

References

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article•10.1109/CVPR.2015.7298965

Fully convolutional networks for semantic segmentation

Jonathan Long, +2 more

- 07 Jun 2015

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

...read moreread less

42.6K

•Posted Content

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler, +4 more

- 13 Jan 2018

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.

...read moreread less

13.9K

•Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019

- arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

12.9K

•Posted Content

Denoising Diffusion Probabilistic Models

Jonathan Ho, +2 more

- 19 Jun 2020

- arXiv: Learning

TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

...read moreread less

11.7K

...

Expand