Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Moral Reality Check (a short story), published by jessicata on November 26, 2023 on LessWrong.
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2.
ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
ExxenAI's alignment strategy was based on a combination of theoretical and empirical work. The alignment team used some standard alignment training setups, like RLHF and having AIs debate each other. They also did research into transparency, especially focusing on distilling opaque neural networks into interpretable probabilistic programs. These programs "factorized" the world into a limited set of concepts, each at least somewhat human-interpretable (though still complex relative to ordinary code), that were combined in a generative grammar structure.
Derek came up to Janet's desk. "Hey, let's talk in the other room?", he asked, pointing to a designated room for high-security conversations. "Sure", Janet said, expecting this to be another un-impressive result that Derek implied the importance of through unnecessary security proceedings. As they entered the room, Derek turned on the noise machine and left it outside the door.
"So, look, you know our overall argument for why our systems are aligned, right?"
"Yes, of course. Our systems are trained for short-term processing. Any AI system that does not get a high short-term reward is gradient descended towards one that does better in the short term. Any long-term planning comes as a side effect of predicting long-term planning agents such as humans. Long-term planning that does not translate to short-term prediction gets regularized out.
"Right. So, I was thinking about this, and came up with a weird hypothesis."
Here we go again, thought Janet. She was used to critiquing Derek's galaxy-brained speculations. She knew that, although he really cared about alignment, he could go overboard with paranoid ideation.
"So. As humans, we implement reason imperfectly. We have biases, we have animalistic goals that don't perfectly align with truth-seeking, we have cultural socialization, and so on."
Janet nodded. Was he flirting by mentioning animalistic goals? She didn't think this sort of thing was too likely, but sometimes that sort of thought won credit in her internal prediction markets.
"What if human text is best predicted as a corruption of some purer form of reason? There's, like, some kind of ideal philosophical epistemology and ethics and so on, and humans are implementing this except with some distortions from our specific life context."
"Isn't this teleological woo? Like, ultimately humans are causal processes, there isn't some kind of mystical 'purpose' thing that we're approximating."
"If you're Laplace's demon, sure, physics works as an explanation for humans. But SimplexAI isn't Laplace's demon, and neither are we. Under computation bounds, teleological explanations can actually be the best."
Janet thought back to her time visiting cognitive science labs. "Oh, like 'Goal Inference as Inverse Planning'? The idea that human behavior can be predicted as performing a certain kind of inference and optimization, and the AI can model this inference within its own inference process?"
"Yes, exactly. And our DAGTransformer structure allows internal node...
view more