Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models

Abstract

Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. For the LLM to generate actionable plans, scene context must be provided, often through a map. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps based on queryable embeddings capable of representing any semantic class. However, embeddings cannot directly report the scene context as they are implicit, requiring further processing for LLM integration. To address this, we propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs due to their text-based nature by building upon large-scale image recognition models. We study how entities in our map can be localized and show through evaluations that our text-based map localizations perform comparably to those from open vocabulary maps while using two to four orders of magnitude less memory. Real-robot experiments demonstrate the grounding of an LLM with the text-based map to solve user tasks.

Explanatory Video

Method Overview

A Tag Map is built from a set of scene viewpoints given as RGB-D frames along with their poses. Each viewpoint is processed through a large-scale image recognition model to generate a set of text tags of the recognized entities from that viewpoint. The mapping between the unique tags recognized in the scene to their corresponding viewpoint poses are stored as the Tag Map scene representation. Despite its information sparsity, we show that we can recover coarse-grained 3D localizations of the tagged entities from such maps. While theses localizations generally do not precisely bound an entity, we find that they are sufficient navigation goals for reaching that entity.

Tag Maps can be easily used to ground a Large Language Models (LLMs) on the scene context by directly adding the tags to the LLM prompt. Further integration with the LLM can be done using the LLM's function calling capabilities, allowing it to query for additional information from the Tag Map. Through a chat interface, users provide task descriptions to the LLM, which then generates navigation plans towards completing the tasks.

Example Tag Map Coarse Localizations

Robot Experiments on Grounded Navigation

BibTeX

@inproceedings{zhang2024tagmap,
  author  = {Zhang, Mike and Qu, Kaixian and Patil, Vaishakh and Cadena, Cesar and Hutter, Marco},
  title   = {Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models},
  journal = {Conference on Robot Learning (CoRL)},
  year    = {2024},
}