Do LLMs learn world models?

Image credit: Microsoft Bing

The short answer is, they do learn, but they also fail to learn world models.

The title of this article encompasses two concepts: “large language models” and “world models.” In the current wave of AI, the definition of the concept of large language models is likely well understood. However, whether a large language model learns a “world model” depends on our definition of the term “world model.”

First, let’s define what a “world” is. A chessboard game contains a highly abstract two-dimensional world, a video game contains a 3D virtual world, the text of a novel constitutes a world, all the videos on the internet form a world, and the real world we live in is also a world. The definition of a “world model” in each of these worlds varies.

For example, in a chessboard game, the world model is the rules of the game, which could be described clearly in just a few pages. In a video game, the world model consists of the rules governing the transition of game states, or in simpler terms, the operational rules of that 3D virtual world. The world model of the real world consists of physical laws, and so on.

If we restrict the term “world model” in the question to the world model within the “world composed of all internet text,” then it’s certain that today’s large language models have learned it to some extent. Why? Because this world model refers precisely to the statistical relationships between words and sentences in the realm of pure text, which coincides with the objective function of large language models to predict the next token.

However, how much knowledge from such a purely textual world model enables a machine intelligence to smoothly transition to the world model of the real world? For the sake of discussion, let’s call the first model “Model T” and the second real-world model “Model R.”

Some people are very optimistic, even believing that Model T is no different from Model R; the two are indistinguishable. Since large language models can fluently respond to various questions, especially those involving time and space, in the textual domain, they argue that this is equivalent to learning our world model, Model R. For example, I once asked ChatGPT a question similar to the following:

Question: “I have two balloons, one filled with hydrogen and the other filled with air. If I release both balloons simultaneously, which balloon will be higher after 10 seconds?” Answer: “The balloon filled with hydrogen will be higher after 10 seconds because hydrogen is lighter than air.”

Question: “There are two speakers in the room, each with a cup placed beside it. The water in the cup next to the first speaker has few ripples, but the water in the cup next to the second speaker constantly produces ripples. So, which speaker has the louder sound?” Answer: “The second speaker has the louder sound. The ripples in the water indicate that it is producing stronger sound waves, causing the water in the cup to vibrate more vigorously.”

Similar questions can be asked many times, and ChatGPT generally can provide correct answers. If a large language model’s capabilities are strong enough, the question descriptions are detailed enough, and there are no ambiguities, then it can essentially “answer all things in the world,” including questions involving the physical space and time of the real world. Does this mean that it has learned model R?

Before answering this question, I’d like to pose another question: Suppose a large language model has learned a world model C (the rules of chess). Would we consider that it has also learned model R?

Many people might say no at this point. But let’s consider it from a different angle: If this language model can explain the rules of chess to me, and even create new strategies for playing chess based on existing rules and analyze them for me, does that count as answering real-world questions? Personally, I think it does, as a brilliant move in a chess game may be comprehended by someone and then mapped to a real-world event, which would be the perfect answer to a physical or social question. So why does it seem like the chessboard model hasn’t learned model R while ChatGPT has?

Notice the term I used above: mapping. Essentially, when a person associates a rule of chess with an event in the real world, they are making a mapping from the chessboard game space to the real world in their mind. Since the chessboard game space is relatively small, such mappings are limited, giving the impression that there are few opportunities for the rules of chess to correspond to real-world events. (It might even be the case that for someone with slightly lower comprehension, they won’t glean anything related to the real world from model C.)

This sense of disconnection between model C and model R is determined by the relatively small size of the chessboard game space. It means that many phenomena in the real world cannot be explained within the chessboard game, so it is unlikely that anyone would consider the chessboard model to have learned the “world model” in the usual sense.

So, if there is another space where the concepts are so numerous and the scope is so broad that it establishes densely interconnected relationships with our real world, a large model trained in this space would give people the illusion that it has learned all the laws of the real world, or that it has learned model R. Unfortunately, or fortunately, the “space composed of all human history/internet texts” is such an example.

Where does this illusion come from? It comes from the long evolution of human language, which is basically capable of describing all the objectively occurring events in the real world (this is also the original intention of inventing language). So when we read a not-so-abstract piece of text, we unconsciously, almost effortlessly, and even subconsciously associate the words and phrases in it with things and events in the real world. In this regard, the sense of disconnect between model T and model R is greatly reduced, and even the two models are conflated. So when a problem in the textual space is accurately answered by a large model, we “feel as if we’re experiencing it,” as if the large model is pointing out the answer to us directly in the real world.

Furthermore, humans have always had a tendency in psychology towards “anthropomorphism,” explaining characteristics of animals or inanimate objects in terms of human abilities, behaviors, or experiences. We might think that because the human brain has learned model R and a large language model can provide answers and explanations to real-world problems, it probably has also learned a similar model R. Although this inclination may not sound rational, it still influences researchers to some extent in their studies of large models. A famous example of this psychological tendency in history is when radios were first invented; many people mistakenly believed that there was a little person inside speaking. Anthropomorphism, to some extent, can help people understand a newly emerging technology in a familiar way of thinking.

So, my conclusion is that strictly speaking, all current large language models can only learn a “parallel world” world model, where “parallel” is relative to the real world. If there’s a need to establish a connection between two worlds, then we need to find ways to teach machines this connection (language <-> visumotor), or have humans act as the intermediary component to apply the “thoughts” of the large model into reality. Another solution is to pour more and more multimodal data into this parallel world so that it can infinitely approach the similarity with the real world. At that point, it might be the right time for us to start discussing whether large models have truly learned model R.

Haonan Yu
Haonan Yu
Researcher & Engineer

Personal page