Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: How DeepMind's Generally Capable Agents Were Trained, published by 1a3orn on the AI Alignment Forum.
Intro
One of DeepMind's latest papers, Open-Ended Learning Leads to Generally Capable Agents, explains how DeepMind produced agents that can successfully play games as complex as hide-and-seek or capture-the-flag without even having trained on or seen these games before.
As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent.
The following is a high-level summary of the paper, meant to be accessible to non-specialists, that should nevertheless produce something resembling a gears-level model. I want to focus on explaining the optimization process that produced this agent; on what the different parts of the optimization process are; on why each different part is necessary; and on what would happen if different parts of it were missing.
After that summary, I'll add a few more comments and questions about design choices within the paper and about future research I'd like to see. I'm far less certain about this second part, however.
I was going to include a part on AI timelines -- but whether this paper influences your timelines, and in what direction, depends on a lot of priors that are out-of-scope for what I want to do here.
The Environment
Before we get into the optimization process of the agent, I need to talk about the environment within which the agent trained. Core to the project of this paper are the millions of dynamically-generated tasks on which the agent can train.
Each task in the XLand Unity-powered 3d environment space is defined by a (1) unique physical world and a (2) unique set of goals / rewards. Thoughout what follows I refer to (1) as the "environment" or "world", to (2) as the "game", and to both of them together as a "task." Note that both of these can be generated programmatically without human intervention.
(The show reel of the trained agents operating on hand-made, held-out test tasks is worth watching for at least a few minutes to get a feel for the complexity possible from both world and goals, and is probably much clearer than my writing about the environment space. [I mean, if you want to understand soccer, watch a game of it, don't read a description.] Although you should note that the intelligibility of the goals in the video is uncharacteristic of the goals in the training tasks, because the show-reel goals were made by humans from human-intelligible games, rather than randomly generated.)
Anyhow. What kind of variety exists in this space?
Well, each world has a static landscape with dynamic, simulated rigid-body objects placed upon it. The topographical features of the world can be randomly mutated; the size of the world and lighting of the world can vary; the rigid-body cubes, pyramids, spheres, and slabs on the map can be randomly colored, sized, and placed.
Each game, with goals and rewards, can also be randomly generated. I'm not being redundant by talking about goals and rewards; each agent both receives information specifying what would cause it to be rewarded (the goal) and receives a numeric reward of 0 or 1 in each timestep.
The fundamental atoms for the definition of goals are atomic predicates, such as being "on", "near", "far", or "holding" something. These atoms can be applied to different entities to form sub-goals, such as "the player is on the yellow floor" or "the black cube is near the black pyramid". A complete goal is then represented as set of options (disjunctions) over some set(s) of necessary predicates (conjunctions) -- a complete goal might be "(Hold a purple sphere AND be near a yellow cube) OR (See an opponent AND be near a black cube)." . Obviously such goals can be randomly generated, and obviously there are a very large number of them.
The total environme...
view more