Defining “Benevolence” in the context of Safe AI

[Note:  This is the first essay since this site was revamped.  I plan to import older writings when and as I get time.]

The question that motivates this essay is “Can we build a benevolent AI, and how do we get around the problem that humans, bless their cotton socks, can’t define ‘benevolence’?”

A lot of people want to emphasize just how many different definitions of “benevolence” there are in the world — the point, of course, being that humans are very far from agreeing a universal definition of benevolence, so how can we expect to program something we cannot define into an AI?

There are many problems with this You-Just-Can’t-Define-It position. First, it ignores the huge common core that exists between different individual concepts of benevolence, because that common core is just … well, not worth attending to! We take it for granted. And ethical/moral philosophers are most guilty of all in this respect: the common core that we all tend to accept is just too boring to talk about or think about, so it gets neglected and forgotten. I believe that if you sat down to the (boring) task of cataloguing the common core, you would be astonished at how much there is.

The second problem is that the common core might (and I believe does) tend to converge as you tend toward people who are on the “extremely well-informed” + plus “empathic” + “rational” end of the scale. In other words, it might well be the case that as you look at people who know more and more about the world, who have strong signs of empathy toward others, and who are able to reason rationally (i.e. are not fanatically religious), you might well find that the convergence toward a common core idea of benevolence becomes even stronger.

Lastly, when people say “You just can’t DEFINE good-vs-evil, or benevolence, or ethical standards, or virtuous behavior….” what they are refering to is the inability to create a closed-form definition of these things.

Closed-Form Definition

So what is a “closed-form definition”?   It means something can be defined in such a way that the form of words fits into a dictionary entry whose size is no more than about a page, and which covers the cases so well that 99.99% of the meaning is captured. There seems to be an assumption that if a closed-form definition exists, then the thing does not exist. This is Definition Chauvinism. It is especially prevalent among people who are dedicated to the idea that AI is Logic; people who believe that meanings can be captured in logical propositions, and the semantics of the atoms of a logical language can be captured in some kind of computable mapping of symbols to the world.

But even without a closed-form definition, it is possible for a concept to be captured in a large number of weak constraints. To people not familiar with neural nets I think this sometimes comes as a bit of a shock. I can build a simple backprop network that captures the spelling-to-sound correspondences of all the words in the English language, and in that network the hidden layer can have a pattern of activation that “defines” a particular word so uniquely that it is distinguished massively and completely from all other words. And yet, when you look at the individual neurons in that hidden layer, each one of them “means” something so vague as be utterly undefinable. (Yes, neural net experts among you, it does depend on the number of units in the hidden layer and how it is trained, but with just the right choice of layer sizes the patterns can be made distributed in such a way that interpretations are pretty damned difficult). In this case, the pronunciation of a given word can be considered to be “defined” by the sum of a couple of hundred factors, EACH OF WHICH is vague to the point of banality. Certainty, in other words, can come from amazingly vague inputs that are allowed to work together in a certain way.

Now, that “certain way” in which the vague inputs are combined is called “simultaneous weak constraint relaxation”. Works like a charm.

If you want another example, try this classic, courtesy of Geoff Hinton: “Tell me what X is, after I give you three facts about X, and if I tell you ahead of time that the three facts are not only vague, but also one of them (I won’t tell you which) is actually FALSE! So here they are:

(1) X was an actor.

(2) X was extremely intelligent.

(3) X was a president.”

(Most people compute what X is within a second. Which is pretty amazing, considering that X could have been anything in the whole universe, and given that this is such an incredibly lousy definition of it.)

So what is the moral of that? Well, the closed-form definition of benevolence might not exist, in just the same way that there is virtually no way to produce a closed-form definition of what how the pronunciation of a word relates to its spelling, if the “words” of the language you have to use are the hidden units of the network capturing the spelling-to-sound mapping. And yet, those “words” when combined in a weak constraint relaxation system allow the pronunciation to be uniquely specified. In just the same way, “benevolence” can be the result of a lot of subtle, hard-to define factors, and it can be extraordinarily well-defined if that kind of “definition” is allowed.

Coming down to brass tacks, now, the practical implication of all this abstract theory is that if we built two different neural nets each trained in different ways to pick up the various factors involved in benevolence, but they were given a large enough data set, we might well find that even though the two nets have built up two completely differents sets of wiring inside, and even though the training sets were not the same, they might converge so closely that if they were tested on a million different “ethical questions”, they might only disagree on a handful of fringe cases. Note that I am only saying “this might happen”, at this stage, because the experiment has not been done ….. but that kind of result is absolutely typical of weak constraint relaxation systems, so I would not be surprised if it worked exactly as described.  So now, if we assume that that did happen, what would it mean to say that “benevolence is impossible to define”?

I submit that the assertion would mean nothing. It would be true of “define” in the sense of a closed-form dictionary definition. But it would be wrong and irrelevant in the context of the way that weak constraint systems define things.

To be honest, I think this constant repetition of “benevolence is undefinable” is a distraction and a waste of our time.

You may not be able to define it.  But I am willing to bet that a systems builder could nail down a constraint system that would agree with virtually all of the human common core decisions about was consistent with benevolence and what was not.

And that, as they say, is good enough for government work.

:-)