What is Friendly AI?

May 3, 2001 by Eliezer S. Yudkowsky

How will near-human and smarter-than-human AIs act toward humans? Why? Are their motivations dependent on our design? If so, which cognitive architectures, design features, and cognitive content should be implemented? At which stage of development? These are questions that must be addressed as we approach the Singularity.

Originally published 2001. Published on KurzweilAI.net May 3, 2001.

“What is Friendly AI?” is a short introductory article to the theory of “Friendly AI,” which attempts to answer questions such as those above. Further material on Friendly AI can be found at the Singularity Institute website at http://singinst.org/friendly/whatis.html, including a book-length explanation.

From Humans to “Minds in General”

A “Friendly AI” is an AI that takes actions that are, on the whole, beneficial to humans and humanity; benevolent rather than malevolent; nice rather than hostile. The evil Hollywood AIs of The Matrix or Terminator are, correspondingly, “hostile” or “unFriendly.”

Having invoked Hollywood, we should also go on to note that Hollywood has, of course, gotten it all wrong–both in its depiction of hostile AIs and in its depiction of benevolent AIs. Since humans have so far dealt only with other humans, we tend to assume that other minds behave the same way we do. AIs allegedly beyond all human control are depicted as being hostile in human ways–right down to the speeches of self-justification made before the defiant human hero.

Humans are uniquely ill-suited to solving problems in AI. A human comes with too many built-in features. When we see a problem, we see the part of the problem that is difficult for humans, not the part of the problem that our brains solve automatically, without conscious attention. Thus, most popular speculation about failure of Friendliness, hostile AIs, talks about failures that would require complex cognitive functionality–features a human would have to actually go out and implement before they’d show up in AIs. We hypothesize that AIs will behave in ways that seem natural to us, but the things that seem “natural” to us are the result of millions of years of evolution; complex functional adaptations composed of multiple subprocesses with specific chunks of brain providing hardware support. Such complexity will not spontaneously materialize in source code, any more than complex dishes like pepperoni pizza will spontaneously begin growing on palm trees.

Reasoning by analogy with humans–“anthropomorphically”–is exactly the wrong way to think about Friendly AI. Humans have a complex, intricate architecture. Some of it, from the perspective of a Friendly AI programmer, is worth duplicating; some is decidedly not worth duplicating; some of it needs to be duplicated, but differently. Assuming that AIs automatically possess “negative” human functionality leads to expecting the wrong malfunctions; to focusing attention on the wrong problems. Assuming that AIs automatically possess beneficial human functionality means not taking the efforts required to deliberately duplicate that functionality.

Even worse is Hollywood’s tendency to stereotype AIs as “machines,” or for people to assume that an AI would behave like, say, Windows 98. A real AI wouldn’t be a computer program any more than a human is an amoeba; most of the complexity would be as far from the program level as a human’s complexity is distant from the cellular level. One of the most stereotypical characteristics of “machines,” for example, is lack of self-awareness. A toaster does not know that it is a toaster; a toaster has no sense of its own purpose, and will burn bread as readily as make toast. A true AI, by contrast, could have complete access to its own source code–a level of self-awareness presently beyond human capability.

Certain other themes are also prevalent in the fictional and popular formulations of the question. In the traditional form, obedience is imposed. Some set of goals, or orders, is held in place against the resistance of the AI’s “innate nature.” This form is especially popular among fiction-writers, since the breakdown of the imposition provides ready-made fodder for the story. Again, this error derives from an attempt to relate to AIs in the same way as we would relate to humans. A human being comes with a prepackaged “innate nature,” and any relation to that human will be a relation to that innate nature. You might say that the fictional formulations address the question of how humanity can deal with some particular nature, while Friendly AI asks how to build a nature–one that we’ll have no trouble relating to. The question is not one of dominance or even coexistence but rather creation. This is not a challenge that humans encounter when dealing with other humans.

Thus, a properly designed Friendly AI does not, as in popular fiction, consist of endless safeguards and coercions stopping the AI from doing this, or forcing the AI to do that, or preventing the AI from thinking certain thoughts, or protecting the goal system from modification. That would be pushing against a lack of resistance–like charging a locked door at full speed, only to find the door ajar. If the AI ever stops wanting to be Friendly, you’ve already lost.

The idea that hostility does not automatically pop up (or at least, that hostility does not pop up in the way usually proposed) is basic to the class of Friendship system described in “Friendly AI.” You can find an in-depth defense of this conclusion in “Beyond anthropomorphism.” Also, since we often hear questions of the form “But why wouldn’t an AI…?,” you can find some fast answers to the top 13 questions of that form in the “Frequently Asked Questions.”

Trying to impose Friendliness against resistance is charging an open door; it is assuming negative human functionality and guarding against the wrong malfunctions. A corresponding error exists for incorrectly assuming positive human functionality; the error of assuming that, in the absence of resistance, you can specify some arbitrary set of goals and then walk away. For one thing, this is almost certainly the wrong attitude to take. One of the conclusions that can be drawn from Friendliness theory–a guiding heuristic, perhaps, if not a first principle–is the idea that building a mind is not like building a tool. Tools can be used for whatever action the wielder likes; a mind can have a sense of its own purpose, and can originate actions to achieve that purpose.

Cognition About Goals

In some ways, Friendly AI is duplicating what humans would call “common sense” in the domain of goals. Not common-sense knowledge; common-sense reasoning. Common-sense reasoning in factual domains is hardly something that can be taken for granted, but it is still a problem that will–of necessity–have already been solved by the time AIs can independently harm or benefit humanity. If humans say that the sky is blue, and the AI (by browsing the Web, or by controlling a digital camera) later finds out that the sky is only blue by day when not obscured by clouds, and is purple with white polka-dots at night, then the description of the color of the sky can be modified accordingly. In fact, it could be modified just by the humans realizing their mistake and providing the AI with further information about the color of the sky.

Here, again, one distinguishes between tool-level AIs and true minds. A tool-level AI simply has the naked fact, stored somewhere in memory, that the sky is blue (1). The fact exists without any knowledge as to its origin, or that the programmers put it there. To alter the concept, the programmers would reach in (perhaps while the AI was shut off) and directly tweak the stored information. By contrast, a mind-level AI would receive, as sensory information, the programmer typing in “the sky is blue.” (Presumably the AI already has real, grounded, useful knowledge of what a “sky” is, and which color is “blue,” or these are just meaningless words.) The sensed keystrokes “the sky is blue” are interpreted as being a meaningful statement by the programmer. The AI estimates how likely the programmer is to know about the sky’s color and assigns a certain probability to the hypothesis that the sky is blue, based on the sensory information that the programmer thinks the sky is blue (or at least, said the sky is blue). Later, this hypothesis can be confirmed and expanded by more direct tests.

If, the next day, the programmer says, “Wait a minute, the sky is purple at night,” then the AI will (presumably) change the hypothesis about the sky’s color to reflect the new information. A nontrivial amount of common-sense reasoning is needed to make this change for the right reasons. It requires that the AI model programmers as knowing more today than they did yesterday, or that the AI understand the idea of a programmer “spotting an error” and correcting it (the AI modeling the programmer modeling the AI!). At a higher level, it implies a sophisticated understanding of causation and validity; a realization, by the AI, that the only reason it ever did believe the sky was blue was that the programmer said so, and that new information from the programmer should therefore override old.

These are some of the behaviors are analyzed at length in “Friendly AI.” (Or, rather, the analogous behaviors are analyzed for goals.) The AI has beliefs about the sky’s color that are probabilistic rather than absolute, and can therefore conceive of the beliefs being “wrong,” and can therefore expand and correct those beliefs. (Discussed in “External reference semantics.”) The AI understands that its beliefs about the sky are derived from human-affirmed information, and that these beliefs will likely be wrong if the humans have made a mistake, and will therefore pay attention to additional information or corrections provided by the humans. (Discussed in “Causal validity semantics.”)

The ability to learn and self-correct is one that can apply to goals as well… as long as the AI is created with that in mind. The question is not whether the AI has the cognitive capability to learn, but whether the AI has the desire to learn. The ability to learn facts is not an easy problem for AI researchers to solve, but it is a problem that must be solved before AIs have the capability to harm or benefit humanity. The ability to learn facts can carry over into the capability to learn goals–to be sensitive to the programmers intentions–only if the AI starts out with the idea that goals are probabilistic and that their presence was human-caused.

The simplest case of a short-circuit would be an AI that had an absolute, non-probabilistic supergoal for “painting cars green.” Actually, this is, in itself, a mistake; for a true AI that happens to work in a car factory, painting cars green should be a subgoal of producing cars, which is a subgoal of fulfilling people’s desire for cars, which means fulfilling a volitional request, which is directly Friendly under the “volition-based Friendliness” formulation used in “Friendly AI.”

We’ll analyze the simpler case, though, in which “painting cars green” is a supergoal, and consider what happens when the factory decides that the AI should paint the cars red instead of green (which you can take as metaphor for needing to tweak some aspect of volition-based Friendliness). If, for some reason, this is a true (non-tool-level) AI, and a seed AI capable of self-modification, then the AI will–obviously–resist any attempt to change its supergoals. Why? Because, if the AI’s supergoal should change, a consequence of that changed supergoal content would be that the AI would take different actions; in this case, the AI predicts that its future self would paint the cars red, instead of green. Since the AI’s current goal is to paint cars green, changing the supergoal would thus be perceived as undesirable; would be predicted to lead to a lesser degree of supergoal fulfillment. This class of short-circuit failure is not inevitable; it requires (we currently think) a relatively small amount effort to design a probabilistic goal architecture, or at least a small amount of effort compared to building a working goal architecture to begin with. The point is that the effort must be taken. The cognitive ability to conceive of a supergoal as being “corrected” is possible, perhaps relatively easy, but not automatic.

The AI doesn’t need to start out with the specific idea that cars might need to be green–there is no need to explicitly anticipate everything in advance–but the AI does need to start out with probabilistic supergoals. If the AI has probabilistic supergoals, then this finite amount of complexity is sufficient to handle any color of the rainbow a car might need to be, no matter how unexpected; it may even be sufficient to handle the transition to a real AI, one that cares about people rather than cars, when the programmers finally wise up. If, however, the AI conceives of its current supergoals as absolute, “correct by definition,” such that nothing is processed as making a change desirable, then this not only prevents the switch from green paint to red paint, or the switch from car-painting to volitional Friendliness, it will also prevent the programmers from modifying the goal system to make supergoals probabilistic. The AI will try to prevent the programmers from modifying it, anyway–an infantlike AI is not likely to have much luck. Still, there’s a stage of development beyond which an AI needs certain architectural features. The AI needs that basic amount of complexity which is required to absorb additional complexity, and to see the acquisition of that complexity as desirable.

An adult human brain contains a huge amount of data–a finite amount, but still an amount too large to be deliberately programmed. However, all that data exists as a result of human learning; the means by which we learn are much more compact than the learned data. And of course, we also learn how to learn. The upshot is that, even though the world is an enormously complex place, it may take only a finite amount of programmer effort to produce an AI that can grow into understanding that world at least as well as a human. After all, it only took a finite amount of evolution to produce the 3 billion bases that comprise the 750-megabyte human genome.

The architectures described in “Friendly AI” are a self-sustaining funnel through which certain kinds of complexity can be poured into an AI, such that the AI perceives the pouring as desirable at any given point in time. There’s more to it than probabilistic supergoals–that was just one example of a kind of structural complexity that humans take for granted–but the list is, nonetheless, finite. It only takes a finite amount of understanding to see the need for any additional understanding that becomes necessary.

Today, Not tomorrow

As best as we can currently figure, the amount of effort needed to create a Friendly AI is small relative to the effort needed to create AI in the first place. But it’s a very important effort. It’s a critical link for the entire human species.

It’s not too early to start thinking about it, no matter how primitive current AIs are. To predict that AI will arrive in thirty years is conservative for futurists; to predict that Friendly AI will be required in five years is conservative for a Friendliness researcher. To predict that the first generally intelligent AIs will be comically stupid is conservative for an AI researcher; to predict that the first generally intelligent AIs may have the intelligence to benefit or harm humans is conservative for a Friendliness researcher. Also, some architectural features may need to be adopted early on, to prevent an unworkable architecture from being entrenched in an infant AI that later begins moving toward general intelligence. The analogy would be to a Y2K bug–representing four-digit years is trivial if you think of it in advance, but very costly if you think of it afterwards.

Combining these two considerations may even bring Friendly AI within reach of “things to actually worry about today.” It is beyond doubt that no current AI project has achieved real AI; all current AIs are tools, and do not make independent decisions that could harm or benefit humans. Similarly, the current scientific consensus seems to be that no present-day project has the potential to eventually grow into a true AI. Some of the researchers working on those projects, though, say otherwise–and it is “conservative” for a Friendliness researcher to believe them, even if his personal theory of AI says that these projects probably won’t succeed.

Of course, an utterly bankrupt project is likely to be too simple to implement even the most basic features of Friendliness, and such projects are beyond the responsibility of even a “conservative” Friendliness researcher to worry about, no matter what pronouncements are made about them. But why not say that–for example–if a project has a sufficiently general architecture to represent probabilistic supergoals, then that architecture probably should use probabilistic supergoals? It’s not much additional effort, compared to implementing a goal system in the first place. Of course, SIAI knows of only one current project advanced enough to even begin implementing the first baby steps toward Friendliness–but where there is one today, there may be a dozen tomorrow. The Singularity Institute’s belief that true AI can be created in ten years is confessedly unconservative, but not our belief that Friendly AI should be done “today, not tomorrow.”

Friendly AI is also important insofar as present-day society has begun debating the peril and promise of advanced technology. The field is not advanced enough to pronounce with certainty that Friendly AI can be created; nonetheless, we can say that, at the moment, it looks possible, and that certain commonly advanced objections are either completely unrealistic or extremely improbable. Thus, a very strong case can be made that–out of all the advanced technologies being debated–Friendly AI is the best technology to develop first. Artificial Intelligence is the only one of our inventions that can, in theory, be given a conscience. Success in developing Friendly AI is more likely to help humanity safely develop nanotechnology than the other way around. Similarly, comparative analysis of Friendly AI relative to computing power suggests that the difficulty of creating AI decreases with increasing computing power, while the difficulty of Friendly AI does not decrease; thus, it is unwise to hold off too long on creating Friendly AI. In this way, the theoretical background provided by present-day knowledge of Friendly AI can be relevant to present-day decisions.

For more information, see the book-length treatment in “Friendly AI,” available on our website.

Array