Download - LW - The Plan - 2023 Version by johnswentworth

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - The Plan - 2023 Version by johnswentworth

2023-12-30

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Plan - 2023 Version, published by johnswentworth on December 30, 2023 on LessWrong.
Background: The Plan, The Plan: 2022 Update. If you haven't read those, don't worry, we're going to go through things from the top this year, and with moderately more detail than before.
1. What's Your Plan For AI Alignment?
Median happy trajectory:
Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly.
Look through our AI's internal concepts for a good alignment target, then
Retarget the Search [1].
…
Profit!
We'll talk about some other (very different) trajectories shortly.
A side-note on
how I think about plans: I'm not really optimizing to make the plan happen. Rather, I think about many different "plans" as possible trajectories, and my optimization efforts are aimed at robust bottlenecks - subproblems which are bottlenecks on lots of different trajectories. An example from the linked post:
For instance, if I wanted to build a solid-state amplifier in 1940, I'd make sure I could build prototypes quickly (including with weird materials), and look for ways to visualize the fields, charge densities, and conductivity patterns produced. Whenever I saw "weird" results, I'd first figure out exactly which variables I needed to control to reproduce them, and of course measure everything I could (using those tools for visualizing fields, densities, etc). I'd also look for patterns among results, and look for models which unified lots of them.
Those are strategies which would be robustly useful for building solid-state amplifiers in many worlds, and likely directly address bottlenecks to progress in many worlds.
Main upshot of approaching planning this way: subproblems which are robust bottlenecks across many different trajectories we thought of are more likely to be bottlenecks on the trajectories we didn't think of - including the trajectory followed by the real world. In other words, this sort of planning is likely to result in actions which still make sense in hindsight, especially in areas with lots of uncertainty, even after the world has thrown lots of surprises at us.
2. So what exactly are the "robust bottlenecks" you're targeting?
For the past few years, understanding
natural abstraction has been the main focus. Roughly speaking, the questions are: what structures in an environment will a wide variety of adaptive systems trained/evolved in that environment
convergently use as internal concepts? When and why will that happen, how can we measure those structures, how will they be represented in trained/evolved systems, how can we detect their use in trained/evolved systems, etc?
3. How is understanding abstraction a bottleneck to any alignment approach at all?
Well, the point of a robust bottleneck is that it shows up along many different paths, so let's talk through a few very different paths (which will probably be salient to very different readers). Just to set expectations: I do not expect that I can jam enough detail into one post that every reader will find their particular cruxes addressed. Or even most readers. But hopefully it will become clear why this "understanding abstraction is a robust bottleneck to alignment" claim is a thing a sane person might come to believe.
How is abstraction a bottleneck to alignment via interpretability?
For concreteness, we'll talk about a "
retargeting the search"-style approach to using interpretability for alignment, though I expect the discussion in this section to generalize. It's roughly the plan sketched at the start of this post: do interpretability real good, look through the AI's internal concepts/language to figure out a good alignment target which we can express in that language, then write that target (in the AI's internal concept-language) into the ...