Journal Article10.48550/arxiv.2403.17827
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Sammy Christen,Shreyas Hampali,Fadime Sener,Edoardo Remelli,Tomas Hodan,Eric Sauser,Shugao Ma,Bugra Tekin +7 more
TL;DR: DiffH2O synthesizes realistic hand-object interactions from textual descriptions, addressing the challenges of generating physically plausible and semantically meaningful motions and generalizing to unseen objects.
read more
Abstract: Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures
![Table 1. Comparison to the State-of-the-Art in Postgrasp.We compare our method with two backbone variants (MDM’s transformer and GMD’s UNet) against IMoS [Ghosh et al. 2023]. We also include real motion capture sequences from GRAB [Taheri et al. 2020] as reference. We present action recognition results using only hands as input, and using hands and objects in combination. We report results on an unseen subject split (top 4 rows), following [Ghosh et al. 2023], and on our unseen object test dataset (bottom 4 rows). For IMoS, we use the same pretrained model, which is trained on unseen subject split, across all our experiments (unseen subject/object splits). This is due to difficulties in reproducing training performance for the unseen object split, and is indicated with a * in the table (IMoS*). ↑ denotes higher values are better, ↓ denotes lower values are better, and→ denotes values closer to ground-truth are better.](/figures/table1-1-6c2otonk5y83.png)
Table 1. Comparison to the State-of-the-Art in Postgrasp.We compare our method with two backbone variants (MDM’s transformer and GMD’s UNet) against IMoS [Ghosh et al. 2023]. We also include real motion capture sequences from GRAB [Taheri et al. 2020] as reference. We present action recognition results using only hands as input, and using hands and objects in combination. We report results on an unseen subject split (top 4 rows), following [Ghosh et al. 2023], and on our unseen object test dataset (bottom 4 rows). For IMoS, we use the same pretrained model, which is trained on unseen subject split, across all our experiments (unseen subject/object splits). This is due to difficulties in reproducing training performance for the unseen object split, and is indicated with a * in the table (IMoS*). ↑ denotes higher values are better, ↓ denotes lower values are better, and→ denotes values closer to ground-truth are better. ![Fig. 5. Overview of the diffusion architecture. Our pipeline relies on a UNet block and processes three input signals: the time step 𝜙 (𝑡 ) , a textprompt embedding T and an object shape encoding M. The time step is encoded using sinusoidal functions, the text-prompt embedding is generated by the CLIP text encoder model and the object encoding is obtained from BPS[Prokudin et al. 2019]. Similarly to [Karunratanakul et al. 2023], we use Adaptive Group normalization in 1D block](/figures/figure5-1-l8ho9p66culc.png)
Fig. 5. Overview of the diffusion architecture. Our pipeline relies on a UNet block and processes three input signals: the time step 𝜙 (𝑡 ) , a textprompt embedding T and an object shape encoding M. The time step is encoded using sinusoidal functions, the text-prompt embedding is generated by the CLIP text encoder model and the object encoding is obtained from BPS[Prokudin et al. 2019]. Similarly to [Karunratanakul et al. 2023], we use Adaptive Group normalization in 1D block 
Table 6. Network architecture.. Model and training hyperparameters of DiffH2O 
Table 2. Comparison to Diffusion Baselines for the Full Sequence. ![Fig. 3. Qualitative Comparison. Post-optimizing object motion as shown in IMoS [Ghosh et al. 2023] (bottom row) can exhibit artifacts with finegrained manipulations, e.g., when an object switches hands. In contrast, our approach (top row) can seamlessly handle such scenarios. Best seen in the supplemental video.](/figures/figure3-1-3j57fyqho09g.png)
Fig. 3. Qualitative Comparison. Post-optimizing object motion as shown in IMoS [Ghosh et al. 2023] (bottom row) can exhibit artifacts with finegrained manipulations, e.g., when an object switches hands. In contrast, our approach (top row) can seamlessly handle such scenarios. Best seen in the supplemental video. 
Fig. 6. Qualitative Examples. We provide more qualitative examples with a) standard generation without any guidance b) grasp guidance c) our model trained with detailed text descriptions.
Citations
GenHeld: Generating and Editing Handheld Objects
Chun-jia Min,Srinath Sridhar +1 more
- 07 Jun 2024
TL;DR: GenHeld generates and edits handheld objects from 3D hand models or 2D images. It selects objects based on hand model or image, positions and orientates them for a plausible grasp, and edits images to add or replace held objects.
RegionGrasp: A Novel Task for Contact Region Controllable Hand Grasp Generation
Sheng Wang,Chuan Guo,Li Cheng,Hai Jiang +3 more
- 10 Oct 2024
TL;DR: RegionGrasp proposes a novel task for generating diverse hand grasps given a 3D object and specific contact region, introducing RegionGrasp-CVAE with ConditionNet and HOINet to enable contact region-awareness and interaction awareness.
Multi-Modal Diffusion for Hand-Object Grasp Generation
Jinkun Cao,Jingyuan Liu,Kris Kitani,Yi Zhou +3 more
- 06 Sep 2024
TL;DR: This work proposes Multi-modal Grasp Diffusion (MGD), a single model that generalizes hand and object shapes, generating hand grasp from heterogeneous data sources, achieving good visual plausibility and diversity in both conditional and unconditional generation.
References
•Posted Content
Denoising Diffusion Probabilistic Models
TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
Embodied hands: modeling and capturing hands and bodies together
TL;DR: A model of hands and bodies interacting together and fit it to full-body 4D sequences that move naturally with detailed hand motions and a realism not seen before in full body performance capture is formulated.
1.1K
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Aravind Rajeswaran,Vikash Kumar,Abhishek Gupta,Giulia Vezzani,John Schulman,Emanuel Todorov,Sergey Levine +6 more
- 26 Jun 2018
TL;DR: We show that model-free DRL with natural policy gradients can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand and solve them from scratch in simulated experiments.
1.1K
•Posted Content
DiffWave: A Versatile Diffusion Model for Audio Synthesis
TL;DR: DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
929
•Posted Content
On the Continuity of Rotation Representations in Neural Networks
TL;DR: A definition of a continuous representation is advanced, which can be helpful for training deep neural networks and related to topological concepts such as homeomorphism and embedding, and results show that continuous rotation representations outperform discontinuous ones for several practical problems in graphics and vision.