Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient Descent on the Human Brain, published by Jozdien on April 2, 2024 on LessWrong.
TL;DR: Many alignment research proposals often share a common motif: figure out how to enter a basin of alignment / corrigibility for human-level models, and then amplify to more powerful regimes while generalizing gracefully. In this post we lay out a research agenda that comes at this problem from a different direction: if we already have ~human-level systems with extremely robust generalization properties, we should just amplify those directly.
We'll call this strategy "Gradient Descent on the Human Brain".
Introduction
Put one way, the hard part of the alignment problem is figuring out how to solve ontology identification: mapping between an AI's model of the world and a human's model, in order to translate and specify human goals in an alien ontology.
In generality, in the worst case, this is a
pretty
difficult
problem. But is solving this problem necessary to create safe superintelligences? The assumption that you need to solve for arbitrary ontologies is true if you assume that the way to get to superintelligence necessarily routes through systems with different ontologies. We don't need to solve ontology translation for high-bandwidth communication with other humans[1].
Thus far, we haven't said anything really novel. The central problem to this approach, as any alignment researcher would know, is that we don't really have a good way to bootstrap the human brain to superintelligent levels. There have been
a few
attempts to approach this recently, though focusing on very prosaic methods that, at best, buy points on the margin. Scaling to superintelligence requires much stronger and robust methods of optimization.
The Setup
The basic setup is pretty simple, though there are a few nuances and extensions that are hopefully self-explanatory.
The simple version: Take a hundred human brains, put them in a large vat, and run gradient descent on the entire thing.
The human brain is a remarkably powerful artifact for its size, so finding a way to combine the capabilities of a hundred human brains with gradient descent should result in something significantly more powerful. As an intuition pump, think of how powerful human organizations are with significantly shallower communication bandwidth.
At the very lowest bound we can surpass this, more impressive versions of this could look like an integrated single mind that combines the capabilities of all hundred brains.
The specifics of what the training signal should be are, I think, a rather straightforward engineering problem. Some pretty off-the-cuff ideas, in increasing order of endorsement:
Train them for specific tasks, such as
Pong or
Doom. This risks loss of generality, however.
Train them to predict arbitrary input signals from the environment. The brain is pretty good at picking up on patterns in input streams, which this leverages to amplify latent capabilities. This accounts for the problem with lack of generality, but may not incentivize cross-brain synergy strongly.
Train them to predict each other. Human brains being the most general-purpose objects in existence, this should be a very richly general training channel, and incentivizes brain-to-brain (B2B) interaction. This is similar in spirit to
HCH.
A slightly more sophisticated setup:
Aside: Whose brains should we use for this?
The comparative advantage of this agenda is the strong generalization properties inherent to the human brain[2]. However, to further push the frontier of safety and allow for a broad basin of graceful failure, we think that the brains used should have a strong understanding of alignment literature. We're planning on running a prototype with a few volunteer researchers - if you want to help, please reach out!
Potential Directions
More sophisticate...
view more