Do LLMs learn world models? (大语言模型学到了世界模型吗?)

Image credit: Microsoft Bing

The short answer is, they do learn, but they also fail to learn world models.

The title of this article encompasses two concepts: “large language models” and “world models.” In the current wave of AI, the definition of the concept of large language models is likely well understood. However, whether a large language model learns a “world model” depends on our definition of the term “world model.”

First, let’s define what a “world” is. A chessboard game contains a highly abstract two-dimensional world, a video game contains a 3D virtual world, the text of a novel constitutes a world, all the videos on the internet form a world, and the real world we live in is also a world. The definition of a “world model” in each of these worlds varies.

For example, in a chessboard game, the world model is the rules of the game, which could be described clearly in just a few pages. In a video game, the world model consists of the rules governing the transition of game states, or in simpler terms, the operational rules of that 3D virtual world. The world model of the real world consists of physical laws, and so on.

If we restrict the term “world model” in the question to the world model within the “world composed of all internet text,” then it’s certain that today’s large language models have learned it to some extent. Why? Because this world model refers precisely to the statistical relationships between words and sentences in the realm of pure text, which coincides with the objective function of large language models to predict the next token.

However, how much knowledge from such a purely textual world model enables a machine intelligence to smoothly transition to the world model of the real world? For the sake of discussion, let’s call the first model “Model T” and the second real-world model “Model R.”

Some people are very optimistic, even believing that Model T is no different from Model R; the two are indistinguishable. Since large language models can fluently respond to various questions, especially those involving time and space, in the textual domain, they argue that this is equivalent to learning our world model, Model R. For example, I once asked ChatGPT a question similar to the following:

Question: “I have two balloons, one filled with hydrogen and the other filled with air. If I release both balloons simultaneously, which balloon will be higher after 10 seconds?” Answer: “The balloon filled with hydrogen will be higher after 10 seconds because hydrogen is lighter than air.”

Question: “There are two speakers in the room, each with a cup placed beside it. The water in the cup next to the first speaker has few ripples, but the water in the cup next to the second speaker constantly produces ripples. So, which speaker has the louder sound?” Answer: “The second speaker has the louder sound. The ripples in the water indicate that it is producing stronger sound waves, causing the water in the cup to vibrate more vigorously.”

Similar questions can be asked many times, and ChatGPT generally can provide correct answers. If a large language model’s capabilities are strong enough, the question descriptions are detailed enough, and there are no ambiguities, then it can essentially “answer all things in the world,” including questions involving the physical space and time of the real world. Does this mean that it has learned model R?

Before answering this question, I’d like to pose another question: Suppose a large language model has learned a world model C (the rules of chess). Would we consider that it has also learned model R?

Many people might say no at this point. But let’s consider it from a different angle: If this language model can explain the rules of chess to me, and even create new strategies for playing chess based on existing rules and analyze them for me, does that count as answering real-world questions? Personally, I think it does, as a brilliant move in a chess game may be comprehended by someone and then mapped to a real-world event, which would be the perfect answer to a physical or social question. So why does it seem like the chessboard model hasn’t learned model R while ChatGPT has?

Notice the term I used above: mapping. Essentially, when a person associates a rule of chess with an event in the real world, they are making a mapping from the chessboard game space to the real world in their mind. Since the chessboard game space is relatively small, such mappings are limited, giving the impression that there are few opportunities for the rules of chess to correspond to real-world events. (It might even be the case that for someone with slightly lower comprehension, they won’t glean anything related to the real world from model C.)

This sense of disconnection between model C and model R is determined by the relatively small size of the chessboard game space. It means that many phenomena in the real world cannot be explained within the chessboard game, so it is unlikely that anyone would consider the chessboard model to have learned the “world model” in the usual sense.

So, if there is another space where the concepts are so numerous and the scope is so broad that it establishes densely interconnected relationships with our real world, a large model trained in this space would give people the illusion that it has learned all the laws of the real world, or that it has learned model R. Unfortunately, or fortunately, the “space composed of all human history/internet texts” is such an example.

Where does this illusion come from? It comes from the long evolution of human language, which is basically capable of describing all the objectively occurring events in the real world (this is also the original intention of inventing language). So when we read a not-so-abstract piece of text, we unconsciously, almost effortlessly, and even subconsciously associate the words and phrases in it with things and events in the real world. In this regard, the sense of disconnect between model T and model R is greatly reduced, and even the two models are conflated. So when a problem in the textual space is accurately answered by a large model, we “feel as if we’re experiencing it,” as if the large model is pointing out the answer to us directly in the real world.

Furthermore, humans have always had a tendency in psychology towards “anthropomorphism,” explaining characteristics of animals or inanimate objects in terms of human abilities, behaviors, or experiences. We might think that because the human brain has learned model R and a large language model can provide answers and explanations to real-world problems, it probably has also learned a similar model R. Although this inclination may not sound rational, it still influences researchers to some extent in their studies of large models. A famous example of this psychological tendency in history is when radios were first invented; many people mistakenly believed that there was a little person inside speaking. Anthropomorphism, to some extent, can help people understand a newly emerging technology in a familiar way of thinking.

So, my conclusion is that strictly speaking, all current large language models can only learn a “parallel world” world model, where “parallel” is relative to the real world. If there’s a need to establish a connection between two worlds, then we need to find ways to teach machines this connection (language <-> visumotor), or have humans act as the intermediary component to apply the “thoughts” of the large model into reality. Another solution is to pour more and more multimodal data into this parallel world so that it can infinitely approach the similarity with the real world. At that point, it might be the right time for us to start discussing whether large models have truly learned model R.

中文翻译:

先说答案,学到了但也没学到。

这篇文章的标题里有两个概念,一个是“大语言模型”,相信在如今AI的浪潮下这个概念的定义没有问题。另一个是“世界模型”,大语言模型学到它与否取决于我们对于”世界模型“的定义。

首先得定义什么是“世界”。一个棋盘游戏包含了一个高度抽象的二维世界,一部电子游戏里包含一个了3D虚拟世界,一本小说的文字也是一个世界,互联网上全部视频组成了一个世界,我们生活的现实世界也是世界。每个这样的世界里所谓的“世界模型”的定义都不一样。

比如棋盘游戏里的世界模型是下棋规则,可能短短一页纸就可以描述清楚。电子游戏里的世界模型是游戏状态转移规则,或者通俗讲是那个3D虚拟世界的一些运转规则。现实世界的世界模型是物理规则等等。

如果说我们将题目问题里的”世界模型“限定于“所有互联网文字组成的世界”里的世界模型,那现今的大语言模型肯定或多或少学到了。为什么?因为这个世界模型指的就是在纯文字领域里词语句子之间的统计关系,这个关系恰好就是跟大语言模型的预测下一个token的目标函数吻合。

不过这样一个纯文字领域的世界模型到底有多少知识能让一个机器人智能体顺利迁移到现实世界的世界模型?为方便讨论,接下来我们称第一个模型为模型T,第二个现实世界的模型为模型R。

有的人非常乐观,甚至觉得模型T就跟模型R一样,两个没有区别:既然大语言模型能够在文字领域对各种各样的问题,特别是涉及到时间空间的问题,都对答如流,那就相当于学到了我们世界的模型R。比如我曾经问过ChatGPT类似如下的问题:

问:我有两个气球,一个充满氢气,另一个充满空气。如果我同时让两个气球脱手,那么10秒之后哪个气球更高? 答:充满氢气的气球10秒后会更高,因为氢气比空气轻。

问:房间里有两个扩音器,每个旁边都放着一个杯子。第一个扩音器旁的杯子里的水没有太多波纹,但是第二个扩音器旁杯子里的水不断有波纹产生。那么哪个扩音器的声音更大? 答:第二个扩音器声音更大。水里的波纹表明它正在产生更强的声波使得杯子里的水震动得更加厉害。

类似的问题还可以问很多,ChatGPT一般都能给出正确的回答。如果一个大语言模型能力足够强,问题的描述足够详细并且没有歧义,那么基本上它就可以“解答世间万物”了,包括涉及到现实世界物理空间和时间的问题。那这样是不是说它就学到了模型R呢?

回答这个问题之前,我想先再抛出一个问题:假如一个大语言模型学到了一个棋盘游戏的世界模型C(下棋规则),那么我们会不会认为它也学会了模型R?

可能这个时候很多人都会认为没有。但是我们可以换一个角度想:如果这个语言模型能给我解释清楚下棋规则,甚至基于已有规则能够创造出新的下好棋的思路并且分析给我听,算不算解答现实世界了?我个人觉得算的,毕竟棋局上的一个妙招可能有人加以领悟,再将其映射到现实世界就是回答一个物理问题或者社会问题的完美答案。那为什么给人的感觉是棋盘大模型没有学到模型R但是ChatGPT做到了?

注意我上面用了一个词:映射。本质上当一个人将一个棋盘规则联想到现实世界一个事件时,脑海中做了一个从棋盘游戏空间到现实世界的映射。由于棋盘游戏空间相比而言实在太小,因此这样的映射有限,给人的感觉好像是棋盘规则能够对应到现实世界事件的机会很少,也就是说我们只能从模型C得到很少的解释现实世界的机会。(甚至可能对于领悟力稍微差一点儿的人来说根本不会从模型C中得到任何跟现实相关的感悟。)

这种模型C和模型R的割裂感是由棋盘游戏空间太小决定的。它注定了现实世界中的很多现象无法在棋盘游戏里得到解释,所以估计不会有人会认为棋盘大模型学到了通常意义下的“世界模型”。

那么如果有另一个空间,里面的概念之多,范围之广足以和我们现实世界建立起密密麻麻,几乎无所不在的关联时,在这个空间训练出来的大模型就会给人一种错觉,那就是现实世界里的规律它都学会了,也就是学会了模型R。很不幸,或者很幸运的是,“人类历史/互联网上所有文本组成的空间”就是这样一个例子。

这个错觉来源于哪里?它来源于人类语言经过漫长的演化,基本上能够描述出来现实世界中客观发生的所有事情(这也是发明语言的初衷)。所以当我们阅读一段不那么抽象的文字时,会不自觉地,几乎毫不费力甚至潜意识里将其中的文字词语和现实世界的事物事件相关联。从这一点来说,模型T和模型R之间的割裂感被大大降低,甚至两个模型被混为一谈。所以当文字空间里的问题被大模型准确回答时,我们“感同身受”,仿佛大模型直接在现实世界里给我们指出了答案。

再加上人类一直以来从心理学上都会有“拟人论“(anthropomorphism)的倾向,以人的能力,行为或经验的术语来解释动物或非生物的有关特性。我们会因为人大脑中学习了模型R,大语言模型能够对现实世界的问题答疑解惑,从而认为它大概率也学会了同样的模型R。虽然这种倾向听起来并不理性,不过或多或少还是影响研究人员对大模型的研究。这种心理倾向在历史上比较有名的例子就是当收音机最开始发明出来的时候,不少人误以为里面住着一个小人在说话。拟人论从某种程度上可以帮助人们以一种熟悉的思维方式去理解一项诞生的新科技。

所以我得到的结论是,严格来讲,目前所有大语言模型只能学到一个”平行世界“的世界模型,这里的”平行“是相对现实世界而言。如果需要在两个世界之间建立联系,那么就需要想办法教会机器这种联系(语言<->视觉运动),或者让人来当这个中间组件,将大模型的“思想”运用到现实中。另一种方案是,将越来越多的多模态数据往这个平行世界里倾灌,这样它就可以和现实世界的相似性无限接近,到那个时候,可能才是我们开始讨论大模型到底有没有学到模型R的时机。

Haonan Yu
Haonan Yu
Researcher & Engineer

Personal page