1. How can large language models (LLMs) be leveraged for grounding task plans in physical scenes?
Large language models (LLMs) can be leveraged for grounding task plans in physical scenes by acquiring rich commonsense knowledge from vast web data. This knowledge can be utilized by embodied agents to generate action plans for human requirements represented in natural language. However, LLMs cannot perceive the surrounding scenes and may generate inexecutable actions due to the requirement of interacting with non-existed objects. To address this, grounding the task plan generated by LLMs to the physical world is necessary. Previous works have filtered or aligned the generated actions by considering visual clues in the scene for the task of general manipulation of tabletop objects. Additionally, visual navigation has been employed to collect information in the house for grounded plan generation. However, these approaches have limitations in diverse deployment scenarios and complex tasks. In this paper, a task planning agent called TaPA is presented, which can generate grounded plans without constraining task types and target objects. TaPA acquires general commonsense knowledge to yield action steps for complex household tasks, providing foundational instructions for downstream navigation and manipulation processes. The agent achieves higher success rates in generating action plans compared to state-of-the-art LLMs and large multimodal models (LMMs). The contributions of this research include proposing a benchmark for complex embodied task planning in realistic indoor deployment scenarios, designing a framework for large-scale multimodal dataset generation, and evaluating different LLMs and LMMs for complex embodied task planning. An ablation study was conducted to select the optimal representation of visual scenes for executable action generation.
read more
2. How do large pre-trained models impact NLP and computer vision?
Large pre-trained models have revolutionized the natural language processing (NLP) and computer vision communities. They benefit from vast training data and numerous parameters, enabling strong generalization across different deployment scenarios. In NLP, LLMs like LLaMA and GPT-3 perform well in tasks and demonstrate the ability to acquire factual knowledge about the real world. In computer vision, large models can detect objects, segment scenes, and align visual features with text embeddings. CLIP uses contrastive learning to minimize distance between similar image-text pairs. LLaVA synthesized a multimodal dataset for tasks like conversation, description, and reasoning, enhancing LLMs' instruction-following visual agent capabilities. Language model grounding for embodied tasks involves generating executable action steps from grounded LLMs using information from the environment or prompt engineering. Researchers have designed language prompts to guide LLMs in generating plausible task plans, with visual information extracted to enhance plan plausibility. However, these models currently struggle with complex tasks beyond simple simulations.
read more
3. How is the large-scale multimodal dataset for planning agent generated?
The large-scale multimodal dataset for planning agent is generated by leveraging GPT-3.5 with a designed prompt that utilizes the class names of objects in a 3D scene as the representation. The prompt simulates scenarios of embodied task planning, generating executable instructions and action plans. The generated instructions are diverse, including requests, commands, and queries, with only those containing explicitly executable actions added to the dataset. The object list used in the prompt is derived from the groundtruth label of existing instances in the scene. The dataset is constructed as triplets (X v, X q, X a), where X v represents the visual scene, X q represents the instruction, and X a represents the executable action steps. The training phase uses the groundtruth object list, while the inference phase employs an open-vocabulary object detector. The AI2-THOR simulator is used as the embodied environment, and the dataset is expanded by modifying the groundtruth object list to increase the scale and diversity of training samples for effective task planner finetuning. A total of 6400 training samples are generated, with 15K samples for training and 60 triplets for evaluation.
read more
4. How to ground task plan to physical world?
To ground the embodied task plan to the physical world with feasibility constraints, it is necessary to accurately obtain the object list in the scene without instance missing or false positives. We generalize the open-vocabulary object detector for object list acquisition. The agent collects RGB images in different locations to perceive the visual scenes and discover existing objects. Image collection strategies are designed with location selection criteria, including traversal positions, random positions, overall center point, and block-wise center points. The agent rotates the camera to obtain multi-view images for each location selection criteria. The hyperparameter grid side length divides the achievable area into grids. Clustering methods, such as K-means, divide the entire scene into sub-regions to improve perception performance. The block-wise center point strategy traverses centroids of subregions to acquire sufficient visual information. The open-vocabulary object detector processes the collected multi-view RGB images to generate the object list. Duplicate object names are removed from the detection results. The task planning algorithm (TaPA) combines the perception results of existing objects and human instructions to generate executable action plans. The block-wise center point strategy is chosen for multi-view RGB image collection, with a grid size of 0.75 and a unit angle for camera rotation of 2p/3.
read more