Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Plan - 2023 Version, published by johnswentworth on December 30, 2023 on LessWrong.
Background: The Plan, The Plan: 2022 Update. If you haven't read those, don't worry, we're going to go through things from the top this year, and with moderately more detail than before.
1. What's Your Plan For AI Alignment?
Median happy trajectory:
Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly.
Look through our AI's internal concepts for a good alignment target, then
Retarget the Search [1].
…
Profit!
We'll talk about some other (very different) trajectories shortly.
A side-note on
how I think about plans: I'm not really optimizing to make the plan happen. Rather, I think about many different "plans" as possible trajectories, and my optimization efforts are aimed at robust bottlenecks - subproblems which are bottlenecks on lots of different trajectories. An example from the linked post:
For instance, if I wanted to build a solid-state amplifier in 1940, I'd make sure I could build prototypes quickly (including with weird materials), and look for ways to visualize the fields, charge densities, and conductivity patterns produced. Whenever I saw "weird" results, I'd first figure out exactly which variables I needed to control to reproduce them, and of course measure everything I could (using those tools for visualizing fields, densities, etc). I'd also look for patterns among results, and look for models which unified lots of them.
Those are strategies which would be robustly useful for building solid-state amplifiers in many worlds, and likely directly address bottlenecks to progress in many worlds.
Main upshot of approaching planning this way: subproblems which are robust bottlenecks across many different trajectories we thought of are more likely to be bottlenecks on the trajectories we didn't think of - including the trajectory followed by the real world. In other words, this sort of planning is likely to result in actions which still make sense in hindsight, especially in areas with lots of uncertainty, even after the world has thrown lots of surprises at us.
2. So what exactly are the "robust bottlenecks" you're targeting?
For the past few years, understanding
natural abstraction has been the main focus. Roughly speaking, the questions are: what structures in an environment will a wide variety of adaptive systems trained/evolved in that environment
convergently use as internal concepts? When and why will that happen, how can we measure those structures, how will they be represented in trained/evolved systems, how can we detect their use in trained/evolved systems, etc?
3. How is understanding abstraction a bottleneck to any alignment approach at all?
Well, the point of a robust bottleneck is that it shows up along many different paths, so let's talk through a few very different paths (which will probably be salient to very different readers). Just to set expectations: I do not expect that I can jam enough detail into one post that every reader will find their particular cruxes addressed. Or even most readers. But hopefully it will become clear why this "understanding abstraction is a robust bottleneck to alignment" claim is a thing a sane person might come to believe.
How is abstraction a bottleneck to alignment via interpretability?
For concreteness, we'll talk about a "
retargeting the search"-style approach to using interpretability for alignment, though I expect the discussion in this section to generalize. It's roughly the plan sketched at the start of this post: do interpretability real good, look through the AI's internal concepts/language to figure out a good alignment target which we can express in that language, then write that target (in the AI's internal concept-language) into the ...
view more