Text2Human

doi:10.1145/3528223.3530104

Journal Article10.1145/3528223.3530104

Text2Human

Yuming Jiang, +5 more

- 01 Jul 2022

- ACM Transactions on Graphics

- Vol. 41, Iss: 4, pp 1-11

64

TL;DR: Extensive quantitative and qualitative evaluations demonstrate that the proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods.

Abstract: Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed Text2Human framework can generate more diverse and realistic human images compared to state-of-the-art methods. Our project page is https://yumingj.github.io/projects/Text2Human.html. Code and pretrained models are available at https://github.com/yumingj/Text2Human.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1109/tpami.2023.3261988

Diffusion Models in Vision: A Survey

01 Jan 2023

- IEEE Transactions on Pattern Analysis an...

TL;DR: Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling as discussed by the authors , and are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.

...read moreread less

568

Journal Article•10.1145/3528223.3530094

AvatarCLIP

Fangzhou Hong, +5 more

- 01 Jul 2022

- ACM Transactions on Graphics

TL;DR: The key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation, to generate unseen 3D avatars with novel animations, achieving superior zero-shot capability.

...read moreread less

125

•Journal Article•10.1007/978-3-031-19787-1_1

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Stefán Snævarr

- 01 Jan 2022

- Lecture Notes in Computer Science

TL;DR: Wang et al. as discussed by the authors collected and annotated a large-scale human image dataset with over 230k samples capturing diverse poses and textures, and rigorously investigated three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.

...read moreread less

68

Journal Article•10.1109/cvpr52729.2023.00589

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Ziqi Huang, +3 more

- 01 Jun 2023

TL;DR: Collaborative Diffusion enables multi-modal face generation and editing by leveraging the synergy between pre-trained uni-modal diffusion models.

...read moreread less

57

Journal Article•10.1109/cvpr52729.2023.01769

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Haomiao Ni, +4 more

- 01 Jun 2023

TL;DR: Conditional image-to-video generation using latent flow diffusion models (LFDM) generates realistic videos from images and conditions by warping the image in the latent space based on the generated temporally-coherent flow.

...read moreread less

41

...

Expand

References

•Proceedings Article•10.1109/CVPR.2017.632

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, +3 more

- 21 Jul 2017

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

...read moreread less

19.6K

•Proceedings Article•10.18653/V1/D19-1410

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, +1 more

- 14 Aug 2019

TL;DR: Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

...read moreread less

12K

•Proceedings Article•10.1109/CVPR.2019.00453

A Style-Based Generator Architecture for Generative Adversarial Networks

Tero Karras, +2 more

- 15 Jun 2019

TL;DR: This paper proposed an alternative generator architecture for GANs, borrowing from style transfer literature, which leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images.

...read moreread less

11.7K

•Proceedings Article•10.1109/CVPR.2018.00068

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, +5 more

- 11 Jan 2018

TL;DR: In this paper, the authors introduce a new dataset of human perceptual similarity judgments, and systematically evaluate deep features across different architectures and tasks and compare them with classic metrics, finding that deep features outperform all previous metrics by large margins on their dataset.

...read moreread less

8K

•Proceedings Article•10.1109/CVPR.2019.00482

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng, +3 more

- 15 Jun 2019

TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.

...read moreread less

7.5K