Download - LW - How to safely use an optimizer by Simon Fischer

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - How to safely use an optimizer by Simon Fischer

2024-03-29

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to safely use an optimizer, published by Simon Fischer on March 29, 2024 on LessWrong.
Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.
Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments.
Introduction
Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low[1] value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g.
instrumental convergence).
What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?
As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.
In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.
One neat trick
This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!
(Harry Potter and the Methods of Rationality, Chapter 56)
Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:OR be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than some threshold c. Let S={xOf(x)
We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset DS of outputs that are dangerous to use because they were highly optimized for f'. But highly optimized outputs have to be rare[3] and thus we can safely assume that |D||S|.
This has an important consequence: It's possible to randomly choose a subset RO such that it's very likely that DR will be empty but SR will not! We can then modify our original objective function by adding the safety constraint that our output should be in R.[4] The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output.
Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in R.
Implementation
But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):
Use a randomly chosen hash function h:O[m] with [m]={0,…,m1} for mN and define R={xOh(x)=0}.[5]
To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.
A set of hash functions H={hh:O[m]} is called strongly universal if it maps the elements of O pairwise-independently and uniformly to [m], i.e.
Here, h is sampled uniformly from H. The property of strong universality implies that a single element is mapped uniformly, i.e.
Such a set of hash functions exists and the functions can be efficiently computed.[6] The advantage of using strongly universal hash functio...