Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to safely use an optimizer, published by Simon Fischer on March 29, 2024 on LessWrong.
Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.
Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments.
Introduction
Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low[1] value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g.
instrumental convergence).
What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?
As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.
In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.
One neat trick
This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!
(Harry Potter and the Methods of Rationality, Chapter 56)
Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:OR be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than some threshold c. Let S={xOf(x)
We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset DS of outputs that are dangerous to use because they were highly optimized for f'. But highly optimized outputs have to be rare[3] and thus we can safely assume that |D||S|.
This has an important consequence: It's possible to randomly choose a subset RO such that it's very likely that DR will be empty but SR will not! We can then modify our original objective function by adding the safety constraint that our output should be in R.[4] The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output.
Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in R.
Implementation
But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):
Use a randomly chosen hash function h:O[m] with [m]={0,…,m1} for mN and define R={xOh(x)=0}.[5]
To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.
A set of hash functions H={hh:O[m]} is called strongly universal if it maps the elements of O pairwise-independently and uniformly to [m], i.e.
Here, h is sampled uniformly from H. The property of strong universality implies that a single element is mapped uniformly, i.e.
Such a set of hash functions exists and the functions can be efficiently computed.[6] The advantage of using strongly universal hash functio...
view more