Download - LW - An even deeper atheism by Joe Carlsmith

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - An even deeper atheism by Joe Carlsmith

2024-01-11

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An even deeper atheism, published by Joe Carlsmith on January 11, 2024 on LessWrong.
(Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.
This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.
Minor spoilers for Game of Thrones.)
In my last essay, I discussed Robin Hanson's critique of the AI risk discourse - and in particular, the accusation that this discourse "others" the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson's critique uncompelling and implausible, but I do think he's pointing at a real discomfort.
In fact, I think that when we bring certain other Yudkowskian vibes into view - and in particular, vibes related to the "fragility of value," "extremal Goodhart," and "the tails come apart" - this discomfort should deepen yet further. In this essay I explain why.
The fragility of value
Engaging with Yudkowsky's work, I think it's easy to take away something like the following broad lesson: "extreme optimization for a slightly-wrong utility function tends to lead to valueless/horrible places."
Thus, in justifying his claim that "any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth," Yudkowsky argues that value is "fragile."
There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value - but more than one possible "single blow" will do so.
For example, he suggests: suppose you get rid of boredom, and so spend eternity "replaying a single highly optimized experience, over and over and over again." Or suppose you get rid of "contact with reality," and so put people into experience machines. Or suppose you get rid of consciousness, and so make a future of non-sentient flourishing.
Now, as Katja Grace points out, these are all pretty specific sorts of "slightly different."[1] But at times, at least, Yudkowsky seems to suggest that the point generalizes to many directions of subtle permutation: "if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly."

ChatGPT imagines "slightly mutated happiness."
Can we give some sort of formal argument for expecting value fragility of this kind? The closest I've seen is the literature on "extremal Goodhart" - a specific variant of Goodhart's law (Yudkowsky gives his description here).[2] Imprecisely, I think the thought would be something like: even if the True Utility Function is similar enough to the Slightly-Wrong Utility Function to be correlated within a restricted search space, extreme optimization searches much harder over a much larger space -
and within that much larger space, the correlation between the True Utility and the Slightly-Wrong Utility breaks down, such that getting maximal Slightly-Wrong Utility is no update about the True Utility.
Rather, conditional on maximal Slightly-Wrong Utility, you should expect the mean True Utility for a random point in the space. And if you're bored, in expectation, by a random point in the space (as Yudkowsky is, for example, by a random arrangement of matter and energy in the lightcone), then you'll be disappointed by the results of extreme but Slightly-Wrong optimization.
Now, this is not, in itself, any kind of airtight argument that any utility function subject to extreme and unchecked optimization pressure has to be exactly right.
But ami...