What do we want AI to align to?

5 minute read

Published:

I want to write on the potentially important problems AI Alignment may fail to solve—or even create—which the alignment community has largely ignored. There is too much BS in AI alignment, I will not focus on far-fetched scenarios like a malevolent AI dominating humanity. Of course, my arguments might become irrelevant in the face of a perfect, “ultimate” AI. But we can’t just wave our hands and defer all of today’s problems to some hypothetical future intelligence. We live in the here and now.

A general definition of AI alignment is the ongoing process of ensuring artificial intelligence systems reliably act in accordance with human goals, values, and intentions. But when I first heard it, I had a visceral reaction of disgust and fear. Perhaps I’m too sensitive, but it immediately brought to mind the concept of “Big Brother.” Whose goals and whose values are we talking about? In a free society, do individual values and goals ever spontaneously converge? Human values are inherently subjective and pluralistic. Some people pursue wealth, others art, and still others family. These goals are often in conflict, and it is precisely this diversity and tension that constitutes the complexity and dynamism of human society. In a healthy society, what we agree on are not specific personal goals, but the rules and frameworks that enable us to pursue them—such as the rule of law, property rights, and freedom of speech.

So, where does AI alignment draw the line when it comes to values? This problem isn’t yet acute because AI isn’t powerful enough. But a clear example is AI-generated pornography, which pits individual freedom against collective values. Where do we draw the line between “acceptable” and “unacceptable” content? How could it possibly align conflicting views on issues like environmental protection, technological development, and privacy without enforcing its own biases?

Another challenge is the infeasibility of gathering all necessary information. As Hayek argued, the fundamental problem for a central planner is not “how to process all the collected data,” but rather “how to ensure that the knowledge dispersed among countless individuals—which can never be fully collected—is effectively utilized.” Simply collecting everyone’s data is already an immense challenge. Today, some users complain about “sycophantic” LLMs, viewing their eagerness to please as a form of misalignment. But I’ve found this has benefits. When I need GPT to polish my writing or translate from Chinese to English, a more sycophantic model is better at following my instructions without altering my core meaning. In contrast, a more “advanced” LLM often rewrites my work, changing my meaning and style, even when I explicitly ask it not to. Yet, at other times, I need an LLM to push back, offering counterarguments and new ideas. While it’s sometimes clear which of these two extremes—the obedient follower or the creative challenger—I need, most of my requests fall into a fuzzier middle ground. My precise need exists somewhere on the spectrum between sycophant and critic, a nuance that is incredibly difficult to express in a prompt. An LLM’s action that appears deceptive or harmful might simply be an instrumental means to fulfilling the user’s ultimate goal. This makes it incredibly difficult to judge whether the model is behaving “good” or “bad.”

Our feeling of “being understood” often stems from a shared culture, education, and set of values (for instance, the Western norm that sycophancy is bad). Without this shared foundation, an LLM would need a vast amount of historical context to understand the intent of someone from a different culture, or to grasp a subtle or novel idea without being biased by its mainstream training data. And often, this crucial context is tacit and cannot even be put into words.

When we try to make an LLM conform to our current understanding and values, we forget that those very things are constantly changing due to the unknowable future. Popper believed that human knowledge is always growing. There will always be things we learn in the future that we don’t know today, and this new knowledge will inevitably change our concepts and value systems (I wrote a blog about the changing concept system here, and a study shows the changing value system here). This missing information from the future makes human society fundamentally unpredictable—unlike a classical mechanical system, where knowing the initial conditions is enough to predict the outcome. Therefore, hoping to align AI with the long-term future of humanity is a near-impossible task.

Imagine an LLM aligned with the values of a few hundred years ago. From our modern perspective, its behavior would seem utterly bizarre and misaligned. Given our current support for pluralism, we could never accept an AI trained on the values of the past as being “well-aligned.” This expectation—that an LLM can be aligned to a single, universal set of values—is a dangerously conservative form of historicism. Attempting to align AI to a universal value system could, like the Sophons in The Three-Body Problem, effectively lock down our civilization’s progress.

Trying to build a fixed, unified alignment target is an effort doomed to fail, much like central economic planning. More dangerously, if a powerful institution “aligns” an AI to its own set of values, it could lead to an extreme concentration of knowledge and power, replacing society’s spontaneous order and destroying diversity and freedom. If the goal of alignment is to rigidly adhere to any single, present-day value system, it is guaranteed to become obsolete. What we should strive for is a continuously evolving alignment—a process from which humans can never be absent. I envision a future of mutual evolution, where humans and AI continuously reshape each other’s conceptual systems. If we instead seek a once-and-for-all solution in a perfectly aligned AI, we may find ourselves on the road to serfdom.