Embodied Task Planning with Large Language Models

Question

1. How can large language models (LLMs) be leveraged for grounding task plans in physical scenes?

2. How do large pre-trained models impact NLP and computer vision?

3. How is the large-scale multimodal dataset for planning agent generated?

4. How to ground task plan to physical world?

Accepted Answer

Large language models (LLMs) can be leveraged for grounding task plans in physical scenes by acquiring rich commonsense knowledge from vast web data. This knowledge can be utilized by embodied agents to generate action plans for human requirements represented in natural language. However, LLMs cannot perceive the surrounding scenes and may generate inexecutable actions due to the requirement of interacting with non-existed objects. To address this, grounding the task plan generated by LLMs to the physical world is necessary. Previous works have filtered or aligned the generated actions by considering visual clues in the scene for the task of general manipulation of tabletop objects. Additionally, visual navigation has been employed to collect information in the house for grounded plan generation. However, these approaches have limitations in diverse deployment scenarios and complex tasks. In this paper, a task planning agent called TaPA is presented, which can generate grounded plans without constraining task types and target objects. TaPA acquires general commonsense knowledge to yield action steps for complex household tasks, providing foundational instructions for downstream navigation and manipulation processes. The agent achieves higher success rates in generating action plans compared to state-of-the-art LLMs and large multimodal models (LMMs). The contributions of this research include proposing a benchmark for complex embodied task planning in realistic indoor deployment scenarios, designing a framework for large-scale multimodal dataset generation, and evaluating different LLMs and LMMs for complex embodied task planning. An ablation study was conducted to select the optimal representation of visual scenes for executable action generation.

Accepted Answer

Large pre-trained models have revolutionized the natural language processing (NLP) and computer vision communities. They benefit from vast training data and numerous parameters, enabling strong generalization across different deployment scenarios. In NLP, LLMs like LLaMA and GPT-3 perform well in tasks and demonstrate the ability to acquire factual knowledge about the real world. In computer vision, large models can detect objects, segment scenes, and align visual features with text embeddings. CLIP uses contrastive learning to minimize distance between similar image-text pairs. LLaVA synthesized a multimodal dataset for tasks like conversation, description, and reasoning, enhancing LLMs' instruction-following visual agent capabilities. Language model grounding for embodied tasks involves generating executable action steps from grounded LLMs using information from the environment or prompt engineering. Researchers have designed language prompts to guide LLMs in generating plausible task plans, with visual information extracted to enhance plan plausibility. However, these models currently struggle with complex tasks beyond simple simulations.

Accepted Answer

The large-scale multimodal dataset for planning agent is generated by leveraging GPT-3.5 with a designed prompt that utilizes the class names of objects in a 3D scene as the representation. The prompt simulates scenarios of embodied task planning, generating executable instructions and action plans. The generated instructions are diverse, including requests, commands, and queries, with only those containing explicitly executable actions added to the dataset. The object list used in the prompt is derived from the groundtruth label of existing instances in the scene. The dataset is constructed as triplets (X v, X q, X a), where X v represents the visual scene, X q represents the instruction, and X a represents the executable action steps. The training phase uses the groundtruth object list, while the inference phase employs an open-vocabulary object detector. The AI2-THOR simulator is used as the embodied environment, and the dataset is expanded by modifying the groundtruth object list to increase the scale and diversity of training samples for effective task planner finetuning. A total of 6400 training samples are generated, with 15K samples for training and 60 triplets for evaluation.

Accepted Answer

To ground the embodied task plan to the physical world with feasibility constraints, it is necessary to accurately obtain the object list in the scene without instance missing or false positives. We generalize the open-vocabulary object detector for object list acquisition. The agent collects RGB images in different locations to perceive the visual scenes and discover existing objects. Image collection strategies are designed with location selection criteria, including traversal positions, random positions, overall center point, and block-wise center points. The agent rotates the camera to obtain multi-view images for each location selection criteria. The hyperparameter grid side length divides the achievable area into grids. Clustering methods, such as K-means, divide the entire scene into sub-regions to improve perception performance. The block-wise center point strategy traverses centroids of subregions to acquire sufficient visual information. The open-vocabulary object detector processes the collected multi-view RGB images to generate the object list. Duplicate object names are removed from the detection results. The task planning algorithm (TaPA) combines the perception results of existing objects and human instructions to generate executable action plans. The block-wise center point strategy is chosen for multi-view RGB image collection, with a grid size of 0.75 and a unit angle for camera rotation of 2p/3.

Accepted Answer

The evaluation metric for generated action plans in this experiment is not explicitly mentioned in the provided text. However, it can be inferred that the metric is used to assess the effectiveness of the TaPA system compared to state-of-the-art LLMs and LMMs. The metric likely measures the quality and accuracy of the action plans generated by the system, taking into account factors such as task completion, efficiency, and adherence to the given visual scenes. The specific details of the evaluation metric would be crucial in understanding the performance and superiority of the TaPA system in embodied task planning. Further research and analysis of the experiment's methodology and results would be necessary to determine the exact evaluation metric used.

Accepted Answer

Action plans are evaluated by 30 researchers who vote on their success. Each plan is assessed by three volunteers who review the groundtruth object list, instructions, and generated action plans. Volunteers determine if the action steps can successfully complete the instructions. Failure cases include counterfactuals and hallucinations. Successful plans require at least two out of three volunteers to agree on their implementation. Unsuccessful cases are annotated with failure types. The success ratio is reported for different scene types and plan generation models.

Accepted Answer

TaPA achieves optimal performance among all large models on all four scenes including kitchen, living room, bedroom, and bathroom. It has a higher average success rate of 6.38% compared to GPT-3.5 on the task of embodied task planning after instruction finetuning. In kitchen scenes, where tasks usually require more steps, the performance of current large models is lower. LLaMA has a poor success rate, reflecting the fact that a single image cannot represent the overall scene information in visual question answering tasks. TaPA, embedded with more expert knowledge, has the lowest percentage of counterfactual occurrences and a lower percentage of hallucination cases compared to LLaMA and GPT-3.3.5. Different image collection strategies were investigated, and the success rate of traversal positions with small grid sizes was affected by the number of collected images and computational cost. The success rate in kitchen scenes is the lowest due to the complexity of tasks and the noise in the object list. Overall, TaPA provides more complete and consistent action steps compared to LLaMA and GPT-3.5.

Accepted Answer

TaPA generates executable action steps by considering the list of all objects in the scene and leveraging a designed text prompt to tune the instruction model. It collects multi-view RGB images in different achievable locations and uses an open-vocabulary object detection framework to discover the object list of the scene for the finetuned instruction model. This process results in a multimodal dataset with visual scenes, instructions, and corresponding plans. The generated plans are evaluated for plausibility and checked for hallucinations, ensuring the interacting object in the plan is present in the input X l or replaced with an alternative item. Exceptional cases, such as interacting objects being part of existing objects or synonyms, are also considered. The dataset and evaluation results demonstrate TaPA's superiority over state-of-the-art LLMs and LMMs in generating action plans.

Embodied Task Planning with Large Language Models

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. How can large language models (LLMs) be leveraged for grounding task plans in physical scenes?

2. How do large pre-trained models impact NLP and computer vision?

3. How is the large-scale multimodal dataset for planning agent generated?

4. How to ground task plan to physical world?

5. What is the evaluation metric for generated action plans?

6. How are action plans evaluated by volunteers?

7. How does TaPA perform in embodied task planning compared to LLaMA and GPT-3.5?

8. How does TaPA generate executable action steps?

Citations

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS)

A Survey of Reasoning with Foundation Models

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Fine-Grained Task Planning for Service Robots Based on Object Ontology Knowledge Via Large Language Models

References

Language Models Are Few-Shot Learners.

LoRA: Low-Rank Adaptation of Large Language Models

Related Papers (5)

Learning to Ground Objects for Robot Task and Motion Planning

Extract Executable Action Sequences from Natural Language Instructions Based on DQN for Medical Service Robots

Using persistence to support incremental system construction

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects.

Embodiment and Human-Robot Interaction: A Task-Based Perspective