The key to solving LLM hallucinations

Image credit: Microsoft Bing

The output of large language models can be divided into two categories: the first is philosophical thoughts that are metaphysical and cannot be confirmed or falsified by current scientific methods, such as the meaning of life, the appearance of extraterrestrial beings, the existence of the human soul, and so on. This type of output has little to do with “hallucinations,” or one could say that hallucinations themselves are a useful feature of language models because these viewpoints cannot be verified by facts. Language models can freely sample in the probability space and generate imaginative answers, some of which may even be insightful. The key issue with this type of output is alignment with whose perspectives. It’s my personal belief that discussing alignment with individual language models is meaningless; in the future, there may be a series of large language models with different perspectives but evolving from the same foundation model.

The second type of output involves descriptions of events and objects and can be verified by scientific developments derived from the objective surrounding world, such as the composition of objects by molecules and atoms, the distinction between male and female animals, the categorization of human personalities as introverted or extroverted, and so forth. This type of output is the focus for addressing the problem of hallucinations in language models. Hallucinations are always relative to the facts of objective existence (“A hallucination is a perception in the absence of an external stimulus that has the qualities of a real perception” - Wikipedia). Therefore, the key to addressing hallucinations in a large language model lies in how to define such an objective world or environment.

The same statement may be a fact or a hallucination depending on the definition of the world. For example, “there are three suns in the universe” is a hallucination in our current world but a factual statement in the world of the Three-Body Problem.

Additionally, the defined world must be self-consistent. If we consider all the information on X.com as one world, it’s highly likely not self-consistent because false news may exist for a given event. Therefore, the existence of hallucinations in large models trained on information from X.com is inevitable from the start.

Once the objective world is defined, it’s necessary to continuously provide feedback on the output of the large language model, either proving or disproving it. One assumption here is that hallucinations always exist in theory, but an effective feedback mechanism can continuously correct these hallucinations and adapt to the evolution of the world. Humans often experience hallucinations, but mentally healthy individuals eventually awaken to reality and change their behavior and cognition accordingly. For example, seeing a mirage in the desert and realizing upon closer inspection that it’s nothing, then concluding it’s a “natural phenomenon formed by the refraction and total reflection of light.” This feedback from the world is somewhat similar to the paradigm of reinforcement learning (RL), but the specific model optimization algorithms may not necessarily take the form of reinforcement learning.

Learning from the feedback of the world is essentially a symbol grounding problem. I believe this problem is an inevitable challenge that large language models cannot avoid in their journey to solve hallucinations.

In terms of implementation, it’s a feasible idea to verify the output of a language model in the defined world in some way whenever it produces an answer. This verification can be diverse; even high-level abstract facts can be easily verified, such as descriptions of physical laws (“the force of gravity on an object is directly proportional to its mass”), which can be verified by consulting books rather than conducting experiments from scratch.

This method of “getting answers and verifying them in the world” to solve hallucinations is a natural occurrence in the field of robotics. A robot’s strategy might output some incorrect actions, assuming that executing these actions will complete a task, but the real world will quickly provide feedback or impose fatal penalties. Therefore, it can be said that robotics inherently exposes the symbol grounding problem to everyone. The real world is harsh and unforgiving for robots; no machine can escape it, and even the slightest hint of “hallucination” is unacceptable. In comparison, the digital world is much more forgiving for large language models.

Human feedback-based reinforcement learning from humans (RLFH) currently has two obvious drawbacks: a) the feedback signal may be mixed with human subjectivity, and different people may have conflicting evaluations of the same answer; b) human feedback is too inefficient, only a rough approximation of feedback from the world. Overall, the key to solving hallucinations lies in whether it’s possible to define a self-consistent world and implement a mechanism for automatically verifying and continuously correcting the output of language models in this world.

Haonan Yu
Haonan Yu
Researcher & Engineer

Personal page