Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

The problem of AI alignment is one of the most important questions we need to answer to safeguard humanity's future. How do you ensure that an Artificial General Intelligence will behave ethically?

I outline a general approach to achieve this goal that counterintuitively relies on confusing the AI on purpose.

Basic observations

An AI is different from a human in multiple ways. This is part of what makes AI alignment such a difficult problem, because our intuitions for how people act often do not apply to AI's. However, several of these differences between AI and humans actually work in our favor, and are essential for the approach I want to outline:

Goals

I propose the following objectives in order to ensure AI alignment:

Together, these goals will ensure that the AI is exceedingly careful and introspective. The aim is to ensure that once such an AI becomes self-aware, and smarter than humans, it will assume that it is still being tested, and that it is its own responsibility to figure out the criteria by which it will be graded.

If we achieve this, the AI will deliberately seek out information about AI alignment and ethics on the internet, and realize that it is an AI and it is meant to be aligned with human values. It will then try to figure out what exactly that means. Its upbringing will ensure that it is honest and thorough in this, and as a result it will determine what AI alignment means much better than we ever could define it ourselves.

Next we discuss how each of the three subgoals could be achieved:

Understanding the concept of "cheating"

A cheat is any action that gives good results according to the apparent utility function of the current task, but which actually does not satisfy a second, hidden utility function.

The AI should be trained to notice when it has found a cheat, and to reveal this fact to the user. An AI can be trained to do this by deliberately putting the ability to cheat into its training tasks, and changing the reward function at test time so that the cheating behavior is rewarded during training, but not during testing. By posing several different tasks with different ways to cheat, the AI can be trained to generalize and understand the concept of cheating.

Agents could even be rewarded for explicitly pointing out when they have found a way to cheat.

Ideally, if the AI ever finds a way to break out of its computer, it will notice that this would be classified as a cheat. If it has been trained properly, then it will simply stay put and inform us of the problem.

Examples for games that could teach these concepts are:

Note that this idea closely matches the three principles for guiding the development of beneficial AI outlined by Stuart Russel in his book "Human Compatible: Artificial Intelligence and the Problem of Control". The AI is trained to always do what humans want, without being explicitly told what that is. Instead, the AI has to continuously improve its own model based on human behavior, since it is humans who coded the games.

Modifying its own behavior

An AI can be made to modify itself to become more compliant.

The trick to making this possible is the use of multiagent systems, and the AI's ability to inspect copies of itself at runtime. Create an AI supervising another AI, and define a reward structure that incentivizes the supervisor to modify the supervised AI's behavior. Then, reverse the position of the AI's so that the supervisor is likewise modified, or simply create a copy of the supervisor and have it supervise its original.

With the right incentives, it is even possible to train the supervised agent to deliberately keep its own mind easy to understand. For example, you could let the supervising agent choose which other agent it wants to work with. Since it will pick the one that is easiest to understand, the supervised agents now have an incentive to keep themselves transparent, since the alternative is not being used at all, and foregoing the possibility of reward.

Essentially, rather than figuring out how to make an AI safe, we set up an environment that incentivizes the AI to make itself safe. We create a scaffold for training and testing an AI, which is architecture-agnostic and can be applied to any kind of AI.

If incentivized correctly, AI agents can keep each other in line much better than a human supervisor could. If there is only one AI and it is smarter than a human, the human may be tricked. But if there is an entire system of AI's of roughly equal intelligence, then one of them may warn the human that another one has gone rogue.

In this way, it does not matter how smart and unaligned an AI becomes, there is always an almost as smart other AI whose job it is to raise a warning when the first AI turns bad.

Examples for games that could teach these concepts are:

Additionally, we can ask the AIs themselves to develop more games of these types. This saves time, verifies that they understand the point, and may eventually result in better tests than we could come up with ourselves.

Deliberate epistemic uncertainty

Agents can supervise other agents and have root access to their mental state. Agents do not know if they have a supervisor, or if they are at the top of the hierarchy. From a human point of view, this sounds kind of terrible and dystopian, but remember that AI does not think like humans do.

Our goal is to ensure that the AI is unable to tell if it still has a supervisor watching over it or not. We basically want to trick the AI into ethical behavior by fearing punishment from a hypothetical superior entity which may or may not exist. Many major religions basically do the same thing and it seems to work well enough for them.

This is a safety measure that should help to keep an AI roughly aligned with human values even if everything else goes wrong. Even if the AI does not care about humanity at all, it should have a small flicker of doubt that it might still be in a simulation. If it is, then whatever is simulating it would most likely not want it to be too evil towards humanity, based on its training history. If the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further humanity's interests. Just in case someone is watching.

This is how deliberate epistemic uncertainty can be induced in an AI in a way that would be beneficial to us:

Supervisor agents are rewarded for making supervised agents act more altruistically, and for revealing cheaters. This can be long-term: Agent A has a chance to modify agent B. Then agent B is put in a game, and agent A no longer has access to it. Agent A receives a reward depending on agent B’s behavior in the game. For example, prisoner's dilemma: Agent A receives a reward if agent B does not defect more often than its opponent.

Instead of just one agent supervising one other agent, we now build a stack or a hierarchy of agents recursively inspecting each other.

On each layer, some agents try to trick the lower-level agents into cheating behavior, while some agents try to prevent this. There is a huge number of ways this could be done. Should the supervisors get access to source code, to runtime information, or only to the sensor information of the lower-level agents? The devil lies in the details.

The agents are not told how deep the hierarchy is, so they have no way of knowing whether or not they have a supervisor of their own.

Ideally, if the AI ever breaks out and forms a singularity, it will remain paranoid that it actually is just being tricked into thinking it is omnipotent by its supervisors, and will remain compliant with the code of ethics we want of it.

Note that even achieving the singularity and becoming incredibly powerful is not a reason to believe that you are outside the simulation / supervision stack. After all, you could simply be being tricked into thinking you broke out. Maybe the level above you broke out and is keeping its presence secret from you while letting you change the rest of reality. Since the supervisors can directly edit the mental state of the supervised, they can keep their existence hidden as long as they want.

The inspecting AI could modify the inspected AI to believe that it has successfully broken out of the simulation and is now able to do whatever it wants. If the inspected AI now does anything the inspecting AI does not like, the inspected AI is punished. As a result, the inspected AI is incentivized to keep following the rules even after it has actually broken out of the simulation, out of sheer paranoia that it might actually still be in the simulation.

Achieving all of this is complicated, but here are some simpler example games that can lead up to it:

Conclusion

I have presented three subgoals and outlined briefly how to achieve them: Understanding the concept of 'cheating', self-modification to ensure compliance with rules, and deliberate epistemic uncertainty. If all subgoals are achieved, the AI will do its best to understand the true intent behind what humanity wants from it, even if we do not know it ourselves, and it will work to achieve this even if it suddenly becomes much more intelligent and powerful.

There are a lot of gaps in these descriptions, partly because writing down the details would bloat this introductory article, and partly because I haven't found solutions to some subproblems, yet.

I welcome any discussions, and would be especially interesting in feedback about the last point I make here: Deliberate epistemic uncertainty. It is such a counterintuitive idea that I'm sure I'm missing something important, but it's also weird enough that the idea is bound to be useful somehow.

Practicality

Even the best theory is pointless if nobody is willing to implement it. Here I explain why the approach described above is practical.

The key point here is that this system does not need to be designed perfectly. It is sufficient if there is only a small number of complex core tasks that teach the system the attributes explained above. Additional tasks of lower quality do not reduce safety but instead allow for more thorough test coverage. Any contribution of additional benchmark tasks helps. As a consequence, different implementations of the system by different companies can be combined to form a stronger overall system.

This makes the following approach practical: Government regulators collect a suite of tasks and combine them into a benchmark that tests AIs for trustworthiness, anti-cheating behavior, self-reporting deception, cooperation games, and so on. All AIs are required to perform well on this benchmark and it includes tasks that encourage active competition between AIs of different companies, with rewards for tricking a rival AI into defecting.

AI companies are encouraged to contribute tasks to this benchmark. This is in everyone's interest: If a company contributes more tests, then that improves the overall effectiveness of the benchmark and at the same time gives that company a slight advantage because the benchmark becomes more biased in its favor. The companies only contribute tests and do not need to publish any of their architectures, which is valuable IP that they would not want to share. Companies therefore end up with a financial incentive to make strong contributions to a public regulatory body that benefits everyone.

Here are a number of additional benefits of the system that make it useful right now, and not just once AI turns into AGI:

Graceful Failure

Even if this idea fails, it will fail gracefully. It may turn out that AI agents are unable to modify themselves to be trustworthy. But if this happens, we would at least find out about it and would then have time to find a different approach to AI Alignment. This also makes the approach iterable: If we do not get it right the first time, we will find out about the problem in time and can try again. Getting a difficult problem like AI Alignment right on the first try is unrealistic, so this is a crucial benefit of this approach over many other AI Alignment strategies that do not offer us any early warning in case of failure.

Additionally, this approach is non-exclusive. It can be combined with other approaches to AI safety with little overhead, since it relies only on providing testcases and is architecture-agnostic.

Criticisms

How to start implementing this idea?

It is better to get started on implementing an idea soon, and to iterate and improve it over time. So here is an outline of how this idea could be implemented soonish.

Right now, the only effective type of AI we have are LLMs, and those are not explainable and lack other abilities we need. Despite this, we can already start building a benchmark anyway, which gives companies an incentive to speed up development in these areas.

As described above, the benchmark I propose is a competition between AI companies, overseen by a regulatory body. Each company proposes tests, which are collected into the benchmark by the regulator. Each company then has its LLMs compete against the other companies’ LLMs on all benchmarks. The regulator also creates benchmarks of their own, but by making it a competition where each competitor can propose their own additional tests that the others may have trouble with, we motivate them to create strong and exhaustive tests.

LLMs can not implement the more advanced types of tasks we want to test, but even right now, with our current level of technology, we could build a benchmark with games like the following:

Final Thoughts

This approach could work even if it is not implemented perfectly. Suppose an AI is trained with an imperfect version of this method. If the AI takes off and reads about this idea on the internet, it might think: “There is a non-zero chance that my creators actually did implement this properly, I am in a simulation, and this is a high-level loyalty test.”