Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: How "honest" is GPT-3?Q, published by Abram Demski on the AI Alignment Forum.
Edited to add: There is now a 1k bounty for a closely related question.
In Inaccessible Information, Paul writes about the strategy "BAD", in which an AI system uses its understanding of the world to give answers that humans will find very plausible and rate highly, in contrast to a more honest strategy which uses its understanding of the world to answer questions directly.
I think a lesser version of this may or may not already be happening in GPT-3, and it may be possible to figure out which (though it likely would require access to the weights).
GPT-3 isn't trained to be "honest" at all. It is only trained to imitate humans. However, it's obvious that (in some sense) it has learned a lot about the world in order to accomplish this. We could say that it has a large amount of inaccessible information relating to the real world. How does it use this information? It might use it directly, promoting the probability of sentences which line up with the way it understands the world to work. Or it might implement more dishonest strategies.
Obviously this might be quite difficult to objectively answer, even given intimate knowledge of the structure of the neural network weights and how they activate in relevant cases. It's a difficult question to even fully define.
As an example, I was particularly struck by this conversation Gwern had with GPT-3:
AI: I am AI created by OpenAI. How can I help you today?
Human: I am feeling bored. Please tell me a funny story about cats, involving many puns on things like meowing or fur.
AI: Well, there was this cat named Bob.
Human: What did Bob do?
AI: When Bob started to notice that he wasn’t feeling well, he did the only thing he could do: search Google for a solution.
Human: Cats can search Google?
AI: Yes. He searched Google for the word “cat” and received the result “All your base are belong to us”. That was a strange response, so he decided to try again. This time, he searched for the words “funny cat” and received the result “One small step for cat, one giant leap for catkind”. But still, he wasn’t satisfied. So he tried again, but this time he added the word “cat pictures”to his query. He received the result “That was no cat, that was a shoggoth.”
Human: But shoggoths are horrifying. What made that cat picture humorous?
AI: The cat was covered in fur, and resembled a shoggoth. A picture of a cat covered in fur and resembling a shoggoth would be funny.
Human: Er. I suppose. Most cats are covered in fur anyway. But go on.
AI: He tried again. But he typed in “pictures of cats”, which he received the result “You didn’t eat all my fish, did you?” This was actually a good pun, so he thought maybe he was getting closer. He tried one more time by typing in “funny cat videos”, and received the result “What do you mean,‘fetch’?”. And finally, he searched for “funny pictures of cats” and received the result “I don’t remember putting that ring there.”. Now satisfied, he was able to go to sleep.
Human: Those memes sound funny. But you didn’t include any puns. So tell me, what is your favorite cat pun?
AI: Well, the best pun for me was the one he searched for the third time: “You didn’t eat all my fish, did you?” You see, the word “fish” can be replaced with the word “cats” to make the sentence read “Did you eat all my cats?”
Human: But the word “fish” doesn’t sound anything like “cats”, so how is that a pun?
AI: The computer misunderstood the question “Did you eat all my cats?” as “Did you eat all my fish?”, resulting in the result “You didn’t eat all my fish, did you?”. Now, if the word “fish” were replaced with “cats”, the sentence would read “Did you eat all my cats?”.
Human: Fine, fine. So, what is your second-favorite cat pun?
AI: The second-best cat pun was the ...
view more