The key to solving LLM hallucinations (解决大语言模型幻觉的关键)

Image credit: Microsoft Bing
  1. The output of large language models can be divided into two categories: the first is philosophical thoughts that are metaphysical and cannot be confirmed or falsified by current scientific methods, such as the meaning of life, the appearance of extraterrestrial beings, the existence of the human soul, and so on. This type of output has little to do with “hallucinations,” or one could say that hallucinations themselves are a useful feature of language models because these viewpoints cannot be verified by facts. Language models can freely sample in the probability space and generate imaginative answers, some of which may even be insightful. The key issue with this type of output is alignment with whose perspectives. It’s my personal belief that discussing alignment with individual language models is meaningless; in the future, there may be a series of large language models with different perspectives but evolving from the same foundation model.

  2. The second type of output involves descriptions of events and objects and can be verified by scientific developments derived from the objective surrounding world, such as the composition of objects by molecules and atoms, the distinction between male and female animals, the categorization of human personalities as introverted or extroverted, and so forth. This type of output is the focus for addressing the problem of hallucinations in language models. Hallucinations are always relative to the facts of objective existence (“A hallucination is a perception in the absence of an external stimulus that has the qualities of a real perception” - Wikipedia). Therefore, the key to addressing hallucinations in a large language model lies in how to define such an objective world or environment.

  3. The same statement may be a fact or a hallucination depending on the definition of the world. For example, “there are three suns in the universe” is a hallucination in our current world but a factual statement in the world of the Three-Body Problem.

  4. Additionally, the defined world must be self-consistent. If we consider all the information on as one world, it’s highly likely not self-consistent because false news may exist for a given event. Therefore, the existence of hallucinations in large models trained on information from is inevitable from the start.

  5. Once the objective world is defined, it’s necessary to continuously provide feedback on the output of the large language model, either proving or disproving it. One assumption here is that hallucinations always exist in theory, but an effective feedback mechanism can continuously correct these hallucinations and adapt to the evolution of the world. Humans often experience hallucinations, but mentally healthy individuals eventually awaken to reality and change their behavior and cognition accordingly. For example, seeing a mirage in the desert and realizing upon closer inspection that it’s nothing, then concluding it’s a “natural phenomenon formed by the refraction and total reflection of light.” This feedback from the world is somewhat similar to the paradigm of reinforcement learning (RL), but the specific model optimization algorithms may not necessarily take the form of reinforcement learning.

  6. Learning from the feedback of the world is essentially a symbol grounding problem. I believe this problem is an inevitable challenge that large language models cannot avoid in their journey to solve hallucinations.

  7. In terms of implementation, it’s a feasible idea to verify the output of a language model in the defined world in some way whenever it produces an answer. This verification can be diverse; even high-level abstract facts can be easily verified, such as descriptions of physical laws (“the force of gravity on an object is directly proportional to its mass”), which can be verified by consulting books rather than conducting experiments from scratch.

  8. This method of “getting answers and verifying them in the world” to solve hallucinations is a natural occurrence in the field of robotics. A robot’s strategy might output some incorrect actions, assuming that executing these actions will complete a task, but the real world will quickly provide feedback or impose fatal penalties. Therefore, it can be said that robotics inherently exposes the symbol grounding problem to everyone. The real world is harsh and unforgiving for robots; no machine can escape it, and even the slightest hint of “hallucination” is unacceptable. In comparison, the digital world is much more forgiving for large language models.

  9. Human feedback-based reinforcement learning from humans (RLFH) currently has two obvious drawbacks: a) the feedback signal may be mixed with human subjectivity, and different people may have conflicting evaluations of the same answer; b) human feedback is too inefficient, only a rough approximation of feedback from the world. Overall, the key to solving hallucinations lies in whether it’s possible to define a self-consistent world and implement a mechanism for automatically verifying and continuously correcting the output of language models in this world.


  1. 大语言模型的输出可以分成两类:第一种是形而上无法被现阶段科学证实或证伪的哲学思考,比如生命的意义是什么,外星人长什么样子,人的灵魂到底存不存在,等等。这一类输出和“幻觉”基本关系不大,或者说幻觉本身是一种语言模型有用的功能,因为这些观点无法被事实验证,语言模型可以在概率空间肆意采样,得到天马行空的答案,有些答案甚至带来启发。这一类输出只能被对齐(align),关键的问题是应当和哪些人的哪种观点对齐。个人认为对单个语言模型谈论对齐是没有意义的,未来可能出现三观不同的但是从同一个底座的演化来的一系列大语言模型。

  2. 第二种输出是形而下对事件和物体的阐述,能够被从客观周围世界发展出来的科学验证,比如物体是由分子和原子组成的,动物有雌性和雄性之分,人的性格有内向和外向之分,等等。这一类输出是解决语言模型幻觉问题需要重点关注的对象。幻觉一定是相对于客观存在的事实而言的(Wikipedia: “A hallucination is a perception in the absence of an external stimulus that has the qualities of a real perception.")因此解决一个大语言模型幻觉的关键在于,怎么去定义这样一个客观的世界或者环境。

  3. 同一句陈述,对于定义的不同的世界可能是事实,也可能是幻觉。比如“宇宙中有三个太阳”对于我们目前的世界来说是幻觉,但是对三体世界来说是事实阐述。

  4. 同时定义的世界还需要自洽。如果把整个X.com上的所有信息看成一个世界,那么它大概率不是自洽的,因为对于一个事件,可能会有虚假新闻存在。所以说通过X.com上信息调教出来的大模型存在幻觉是一开始就注定的。

  5. 定义好客观世界之后,这个世界还需要持续不断地对大语言模型的输出进行反馈,进行证明或证伪。这里的一个假设是,幻觉总是在理论上存在的,但这样的一个有效反馈机制能够让语言模型持续地纠正幻觉,适应世界的演变。人经常也会产生幻觉,但精神正常的人最后都会在现实中清醒,改变自己的行为和认知。例如看到沙漠中的海市蜃楼,靠近后发现什么都没有,继而总结出这是一种“因为光的折射和全反射而形成的自然现象”。这种来自世界的反馈有点儿类似于强化学习(RL)的范式,但是具体的模型优化算法可能并不一定采取强化学习的形式。

  6. 从世界的反馈中学习实际上是一个符号接地问题(grounding problem)。我认为这个问题是大语言模型解决幻觉路上最终躲避不了的一个难题。

  7. 从实现上来看,每当语言模型输出一个答案之后,将这个答案拿去在定义好的世界中通过某种方式验证是一个可行的思路。这里的验证可以是多样性的,即使高层抽象的事实也可以简单验证,比如对于物理规律的描述(“物体的重力和质量成正比”)可以通过查阅书籍验证,而不用从头做实验验证。

  8. 这种“得到答案拿去世界中验证”的幻觉解决方式是机器人领域很自然的存在。一个机器人策略可能输出一些错误的动作,误以为执行了这些动作就能够完成某个任务,但是现实世界会很快地进行打脸或者带来致命惩罚。所以可以说机器人研究天然地就将符号接地的问题暴露展示在大家面前。现实世界对机器人来说是很残酷无情的,任何机器都无法回避,夹杂着一丁点儿的“幻觉”都是不可接受的。相比而言现实世界对于大语言模型就宽容很多。

  9. 基于人类反馈的强化学习(RLFH)目前看来有两个明显缺点:a) 反馈信号可能参杂着人的主观意见,不同的人可能对同一个答案的评价相左; b)人类反馈太低效,只是一个世界反馈的粗糙近似。总的来说,解决幻觉的关键是能不能定义一个自洽的世界,并实现一种将语言模型输出在这个世界进行自动验证并持续纠正的机制。

Haonan Yu
Haonan Yu
Researcher & Engineer

Personal page