Journal Article10.1145/3528223.3530104
Text2Human
64
TL;DR: Extensive quantitative and qualitative evaluations demonstrate that the proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods.
read more
Abstract: Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods. Our project page is https://yumingj.github.io/projects/Text2Human.html. Code and pretrained models are available at https://github.com/yumingj/Text2Human.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Diffusion Models in Vision: A Survey
TL;DR: Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling as discussed by the authors , and are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
AvatarCLIP
TL;DR: The key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation, to generate unseen 3D avatars with novel animations, achieving superior zero-shot capability.
125
StyleGAN-Human: A Data-Centric Odyssey of Human Generation
TL;DR: Wang et al. as discussed by the authors collected and annotated a large-scale human image dataset with over 230k samples capturing diverse poses and textures, and rigorously investigated three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
Collaborative Diffusion for Multi-Modal Face Generation and Editing
Ziqi Huang,Kelvin C. K. Chan,Yuming Jiang,Ziwei Liu +3 more
- 01 Jun 2023
TL;DR: Collaborative Diffusion enables multi-modal face generation and editing by leveraging the synergy between pre-trained uni-modal diffusion models.
57
Conditional Image-to-Video Generation with Latent Flow Diffusion Models
Haomiao Ni,Changhao Shi,Kai Li,Sharon X. Huang,Martin Renqiang Min +4 more
- 01 Jun 2023
TL;DR: Conditional image-to-video generation using latent flow diffusion models (LFDM) generates realistic videos from images and conditions by warping the image in the latent space based on the generated temporally-coherent flow.
41
References
Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola,Jun-Yan Zhu,Tinghui Zhou,Alexei A. Efros +3 more
- 21 Jul 2017
TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers,Iryna Gurevych +1 more
- 14 Aug 2019
TL;DR: Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras,Samuli Laine,Timo Aila +2 more
- 15 Jun 2019
TL;DR: This paper proposed an alternative generator architecture for GANs, borrowing from style transfer literature, which leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images.
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Richard Zhang,Phillip Isola,Phillip Isola,Alexei A. Efros,Eli Shechtman,Oliver Wang +5 more
- 11 Jan 2018
TL;DR: In this paper, the authors introduce a new dataset of human perceptual similarity judgments, and systematically evaluate deep features across different architectures and tasks and compare them with classic metrics, finding that deep features outperform all previous metrics by large margins on their dataset.
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Jiankang Deng,Jia Guo,Niannan Xue,Stefanos Zafeiriou +3 more
- 15 Jun 2019
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.