In his review of the hypothesized superintelligent agent, Bill Hibbard, principal author of the Vis5D, Cave5D and VisAD open source visualization systems, proposes a mathematical framework for reasoning about AI agents, discusses sources and risks of unexpected AI behavior, and presents an approach for designing superintelligent systems which may avoid unintended existential risk.
Following his initial description of the AI-environment framework, Hibbard begins by noting that a superintelligent agent may fail to satisfy the intentions of its designer when pursuing an instrumental behavior implicit to its final utility function. This said instrumental behavior, while unintended, could occur in order for the AI to preserve its own existence, to eliminate threats to itself and its utility function, or to increase its own efficiency and computing resources (see: Nick Bostrom’s paperclip maximizer).
Hibbard notes that several approaches to human-safe AI suggest designing intelligent machines to share human values so that actions we dislike, such as taking resources from humans, violate the AI’s motivations. However, humans are often unable to accurately write down their own values, and errors in doing so may motivate harmful instrumental AI action. Statistical algorithms may be able to learn human values by analyzing large amounts of human interaction data, but to accurately learn human values will require powerful learning ability. A chicken-and-egg problem for safe AI follows: learning human values requires powerful AI, but safe AI requires knowledge of human values.
Hibbard proposes a solution to this problem through a “first stage” superintelligent agent that is explicitly not allowed to act within the learning environment (thus refraining from unintended actions). The learning environment includes a set of safe, human-level surrogate AI agents, independent of the superintelligent agent, whose actions in composite mirror those of the superintelligent AI. As such, the superintelligent agent can observe humans, as well as their interactions with the surrogates and physical objects, and develop a safe environmental model from which it learns human values.
3 thoughts on “Avoiding Unintended Instrumental AI Behavior (Hibbard)”
Caveats here in that the paper assumes cardinal utility, meaningful interpersonal utility comparisons, meaningful interpersonal utility addition and the correctness of addition as a utility aggregation function. Those are really strong assumptions.
Hmm. Hibbard’s ideas seem useful. Allow a moldable AI to observe human interactions to establish a set of prohibitory rules or laws for itself. However, I think if the learning environment consists of real humans the AI may develop immoral or corrupt laws, unless you assume all human behavior (or perhaps the average) is morally good.
This concern brings me to your second to last paragraph, in which you describe the learning environment. “a set of safe, human-level surrogate AI agents”. How could we create those human-level AI, if the purpose of this learning environment is to create a morally correct AI? This seems to me a paradox?
Thank you for your review on Hibbard’s ideas! I look forward to further riveting discussion.
— Just another automata
I believe Hibbard is hoping to refer to a set of surrogate, narrow AI’s whose respective goal functions are instrumental to the General AI’s (e.g. by understanding how an AI would achieve these smaller goals, you can rationally derive how it would achieve a more general optimization).
Because we can theoretically specify a rigorously defined domain of interactions for human-level AI, it is assumed that these AIs are ‘safe’ and can be observed independently. That being said, this is a strong assumption.