(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
- On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
- The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.[1] If I've misrepresented your view, I apologize.
- I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.