Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Pointer Resolution Problem, published by Jozdien on February 17, 2024 on LessWrong.
Imagine that you meet an 18th century altruist. They tell you "So, I've been thinking about whether or not to eat meat. Do you know whether animals have souls?" How would you answer, assuming you actually do want to be helpful?
One option is to spend a lot of time explaining why "soul" isn't actually the thing in the territory they care about, and talk about
moral patienthood and theories of
welfare and moral status. If they haven't walked away from you in the first thirty seconds this may even work, though I wouldn't bet on it.
Another option is to just say "yes" or "no", to try and answer what their question was pointing at. If they ask further questions, you can either dig in deeper and keep translating your real answers into their ontology or at some point try to retarget their questions' pointers toward concepts that do exist in the territory.
Low-fidelity pointers
The problem you're facing in the above situation is that the person you're talking to is using an inaccurate ontology to understand reality. The things they actually care about correspond to quite different objects in the territory. Those objects currently don't have very good pointers in their map. Trying to directly redirect their questions without first covering a fair amount of context and inferential distance over what these objects are probably wouldn't work very well.
So, the reason this is relevant to alignment:
Representations of things within the environment are learned by systems up to the level of fidelity that's required for the learning objective. This is true even if you assume a weak version of the natural abstraction hypothesis to be true; the general point isn't that there wouldn't be concepts corresponding to what we care about, but that they could be very fuzzy.
For example, let's say that you try to
retarget an internal general-purpose search process. That post describes the following approach:
Identify the AI's internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
Identify the retargetable internal search process.
Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
There are - very broadly, abstracting over a fair amount of nuance - three problems with this:
You need to have interpretability tools that are able to robustly identify human-relevant alignment properties from the AI's internals[1]. This isn't as much a problem with the approach, as it is the hard thing you have to solve for it to work.
It doesn't seem obvious that existentially dangerous models are going to look like they're doing fully-retargetable search. Learning some heuristics that are specialized to the environment, task, or target are likely to make your search much more efficient[2]. These can be selectively learned and used for different contexts. This imposes a cost on arbitrary retargeting, because you have to relearn those heuristics for the new target.
The concept corresponding to the alignment target you want is not very well-specified. Retargeting your model to this concept probably would make it do the right things for a while. However, as the model starts to learn more abstractions relating to this new target, you run into an under-specification problem where the pointer can generalize in one of several ways.
The first problem (or at least, some version of it) seems unavoidable to me in any solution to alignment. What you want in the end is to interact with the things in your system that would ensure you of its safety. There may be simpler ways to go about it, however. The second problem is somewhat related to the third, and would I think be solve...
view more