What is language grounding and why is it needed for an AI agent?
I’ve been asked by quite a few people about what “language grounding” means. So I think I’ll write a short post explaining it, and specifically arguing why it is important to a truly intelligent agent.
Grounding is a computational process of associating symbols with concrete concepts in the current context. It is a concretization process, a reverse process of abstracting environmental elements by symbols. For efficient communication, humans invent low-bandwidth symbols/languages to summarize things and concepts in their surrounding environments. Thus, to really understand those tokens, reversing this kind of abstraction (grounding) is necessary. Grounding makes it possible to share sensory experiences across individuals via language. Without grounding, conversation becomes pure symbol manipulation.
Practically, why grounding is important for designing an AI agent? Suppose that a human user sends a command “Fetch me an apple” to a robot assistant at home. The robot might have lots of hypotheses about how this task could be achieved by querying its internal common sense database (e.g., an LLM trained from large-scale internet text data). It soon finds different options like “Find the apple in a basket that is in the kitchen”, “Buy an apple at a grocery store”, “Pick an apple from the apple tree in the backyard”, “Get an apple from the fridge”, and so on. If the robot’s language understanding is ungrounded, it can either try these options (whose number could be numerous!) one by one, or randomly pick one to execute. This will lead to failures in most cases.
On the other hand, if the robot is able to ground the language command, say, by relating “Fetch an apple” to the current home environment: it observes no fruit basket, no apple tree in the backyard, no grocery stores nearby, but a fridge in the kitchen. Based on this observation, it can successfully choose the most likely plan to execute the command. (This grounding ability is also what SayCan  explores.) Moreover, to successfully execute the task “Take out an apple from the fridge”, the robot needs to locate the fridge, open the fridge door, locate the apple, take it out and to the human user. This sequence of actions requires a correct mapping from word tokens “take”, “apple”, and “fridge” to action decisions, which is yet another lower-level grounding problem.
Of course, the human user can be more explicit about the apple-fetch task by describing the scenario in details in the command: “Our home has a fridge in the kitchen that is located on the right of the house. There are some apples inside the fridge. Go grab an apple for me.” This can resolve most of the ambiguity in the shorter command “Fetch me an apple”, but it requires a tedious elaboration every time the user asks the robot to do the same thing.
Still, some people might argue whether an explicit grounding procedure is needed or not. From the perspective of computer vision, suppose we have already trained a versatile object detector that is able to detect thousands of object classes, covering almost all daily objects, can we just use the detected labels to inform the robot’s actions? For example, if we have generated the apple label in the scene image, and given the command “Find an apple”, can we directly let the robot go to that object and we are all set? The answer is yes in this case. However, the grounding problem might be far more complex than this simple case.
When we assign labels to detected object classes from an object detector, we are actually using some symbol space that might be different from our language space. Let’s say we define 1000 labels, including “sedan”, “suv”, and “car”, where “car” represents all other vehicles that are neither a sedan nor an SUV. Now given a scenario where an SUV is available and a command “Fetch my wallet from the car”, if we strictly follow the label definition of the detector, then the robot doesn’t know how to achieve the task because there is no “car” available! The problem is that the class labels like “sedan”, “suv”, and “car” are just symbols, and they can be “A”, “B”, and “C” instead. Because human language can be highly ambiguous and dependent on the context, directly establishing mappings from these detector labels to language tokens will not work. In the example, the word “car” can be very flexible and given the scenario, it clearly refers to the SUV. Thus, establishing a dynamic mapping between language tokens and detector labels is required, in different contexts. This ability of dynamic mapping is in fact language grounding.
The requirement for grounding is more obvious when it comes to abstract words, like spatial-relation prepositions. The meaning of “on top of” could be totally different in different contexts, and there is no way a robot can really achieve a command containing “on top of” without grounding it in the environment. For example, an object “on top of” a table could have a very different absolute Z coordinate depending on how high the table is. So the robot arm will have different target poses for the same command.
So how should we solve the grounding problem? Can we just collect a huge dataset of (language, environment_context, robot_trajectory) triplets and use supervised training with a general model architecture to make the grounding ability emerge by learning the mapping (language, environment_context) -> robot_trajectory? In theory, this approach is possible, but in practice, the dataset size will be too large to be feasible. Others believe in grounding individual words first then combining them following language syntax to ground whole sentences. However, this is only a rough idea without concrete execution plans so far.
Since its launch, ChatGPT  has drawn lots of attention. However, because it was not explicitly trained to have the grounding ability, the output of ChatGPT has to be interpreted by humans eventually. For example, when a user types a general request “Help me book a ticket for traveling to San Jose on Aug 1st 2023”, it can’t reliably go to an airline website and place the order yet. Thus the command-execution loop has to be closed by real humans for now, resulting in a huge gap between GPT and real AGI. Fortunately, there have been some efforts towards this problem, for example, ACT-1  and ChatGPT plugins . In general, it might be possible to equip LLMs such as ChatGPT with some kind of grounding ability in a constrained virtual world where action/observation space is also highly abstract and restrictive. But to make these LLMs interact with our real physical world and actually free us from repetitive labor work, we still have a long way ahead towards solving the grounding problem.
 Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al, CoRL 2022.
 https://openai.com/blog/chatgpt/, OpenAI
 https://www.adept.ai/act, Adept.ai  https://openai.com/blog/chatgpt-plugins