Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quick takes on "AI is easy to control", published by So8res on December 2, 2023 on LessWrong.
A friend asked me for my quick takes on "
AI is easy to control", and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:
Re: "AIs are white boxes", there's a huge gap between having the weights and understanding what's going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.
Another thing that undermines this hope is a problem of ordering: it's true that we probably can figure out what's going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we'd need to align the things.
But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it's easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the "understanding" route you've got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper.
Re: "Black box methods are sufficient", this sure sounds a lot to me like someone saying "well we trained the squirrels to reproduce well, and they're doing great at it, who's to say whether they'll invent birth control given the opportunity". Like, you're not supposed to be seeing squirrels invent birth control; the fact that they don't invent birth control is no substantial evidence against the theory that, if they got smarter, they'd invent birth control and ice cream.
Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn't that humans couldn't figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn't have a great track record of
getting their scientific theories to generalize properly on the first relevant try - even Newtonian mechanics (with all its empirical validation) didn't generalize properly to high-energy regimes. Humanity's first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity's "first theories" usually are.
Usually humanity has room to test those "first theories" and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don't have that option, and so the challenge is heightened.
Re: Sensory interventions: yeah I just don't expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity's attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/training). (Putting yourself into the AI's shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)
Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).
Overall take: unimpressed.
My f...
view more