Download - LW - The Pointer Resolution Problem by Jozdien

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - The Pointer Resolution Problem by Jozdien

2024-02-17

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Pointer Resolution Problem, published by Jozdien on February 17, 2024 on LessWrong.
Imagine that you meet an 18th century altruist. They tell you "So, I've been thinking about whether or not to eat meat. Do you know whether animals have souls?" How would you answer, assuming you actually do want to be helpful?
One option is to spend a lot of time explaining why "soul" isn't actually the thing in the territory they care about, and talk about
moral patienthood and theories of
welfare and moral status. If they haven't walked away from you in the first thirty seconds this may even work, though I wouldn't bet on it.
Another option is to just say "yes" or "no", to try and answer what their question was pointing at. If they ask further questions, you can either dig in deeper and keep translating your real answers into their ontology or at some point try to retarget their questions' pointers toward concepts that do exist in the territory.
Low-fidelity pointers
The problem you're facing in the above situation is that the person you're talking to is using an inaccurate ontology to understand reality. The things they actually care about correspond to quite different objects in the territory. Those objects currently don't have very good pointers in their map. Trying to directly redirect their questions without first covering a fair amount of context and inferential distance over what these objects are probably wouldn't work very well.
So, the reason this is relevant to alignment:
Representations of things within the environment are learned by systems up to the level of fidelity that's required for the learning objective. This is true even if you assume a weak version of the natural abstraction hypothesis to be true; the general point isn't that there wouldn't be concepts corresponding to what we care about, but that they could be very fuzzy.
For example, let's say that you try to
retarget an internal general-purpose search process. That post describes the following approach:
Identify the AI's internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
Identify the retargetable internal search process.
Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
There are - very broadly, abstracting over a fair amount of nuance - three problems with this:
You need to have interpretability tools that are able to robustly identify human-relevant alignment properties from the AI's internals[1]. This isn't as much a problem with the approach, as it is the hard thing you have to solve for it to work.
It doesn't seem obvious that existentially dangerous models are going to look like they're doing fully-retargetable search. Learning some heuristics that are specialized to the environment, task, or target are likely to make your search much more efficient[2]. These can be selectively learned and used for different contexts. This imposes a cost on arbitrary retargeting, because you have to relearn those heuristics for the new target.
The concept corresponding to the alignment target you want is not very well-specified. Retargeting your model to this concept probably would make it do the right things for a while. However, as the model starts to learn more abstractions relating to this new target, you run into an under-specification problem where the pointer can generalize in one of several ways.
The first problem (or at least, some version of it) seems unavoidable to me in any solution to alignment. What you want in the end is to interact with the things in your system that would ensure you of its safety. There may be simpler ways to go about it, however. The second problem is somewhat related to the third, and would I think be solve...