Download - How DeepMind's Generally Capable Agents Were Trained by 1a3orn

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

How DeepMind's Generally Capable Agents Were Trained by 1a3orn

2021-12-04

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: How DeepMind's Generally Capable Agents Were Trained, published by 1a3orn on the AI Alignment Forum.
Intro
One of DeepMind's latest papers, Open-Ended Learning Leads to Generally Capable Agents, explains how DeepMind produced agents that can successfully play games as complex as hide-and-seek or capture-the-flag without even having trained on or seen these games before.
As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent.
The following is a high-level summary of the paper, meant to be accessible to non-specialists, that should nevertheless produce something resembling a gears-level model. I want to focus on explaining the optimization process that produced this agent; on what the different parts of the optimization process are; on why each different part is necessary; and on what would happen if different parts of it were missing.
After that summary, I'll add a few more comments and questions about design choices within the paper and about future research I'd like to see. I'm far less certain about this second part, however.
I was going to include a part on AI timelines -- but whether this paper influences your timelines, and in what direction, depends on a lot of priors that are out-of-scope for what I want to do here.
The Environment
Before we get into the optimization process of the agent, I need to talk about the environment within which the agent trained. Core to the project of this paper are the millions of dynamically-generated tasks on which the agent can train.
Each task in the XLand Unity-powered 3d environment space is defined by a (1) unique physical world and a (2) unique set of goals / rewards. Thoughout what follows I refer to (1) as the "environment" or "world", to (2) as the "game", and to both of them together as a "task." Note that both of these can be generated programmatically without human intervention.
(The show reel of the trained agents operating on hand-made, held-out test tasks is worth watching for at least a few minutes to get a feel for the complexity possible from both world and goals, and is probably much clearer than my writing about the environment space. [I mean, if you want to understand soccer, watch a game of it, don't read a description.] Although you should note that the intelligibility of the goals in the video is uncharacteristic of the goals in the training tasks, because the show-reel goals were made by humans from human-intelligible games, rather than randomly generated.)
Anyhow. What kind of variety exists in this space?
Well, each world has a static landscape with dynamic, simulated rigid-body objects placed upon it. The topographical features of the world can be randomly mutated; the size of the world and lighting of the world can vary; the rigid-body cubes, pyramids, spheres, and slabs on the map can be randomly colored, sized, and placed.
Each game, with goals and rewards, can also be randomly generated. I'm not being redundant by talking about goals and rewards; each agent both receives information specifying what would cause it to be rewarded (the goal) and receives a numeric reward of 0 or 1 in each timestep.
The fundamental atoms for the definition of goals are atomic predicates, such as being "on", "near", "far", or "holding" something. These atoms can be applied to different entities to form sub-goals, such as "the player is on the yellow floor" or "the black cube is near the black pyramid". A complete goal is then represented as set of options (disjunctions) over some set(s) of necessary predicates (conjunctions) -- a complete goal might be "(Hold a purple sphere AND be near a yellow cube) OR (See an opponent AND be near a black cube)." . Obviously such goals can be randomly generated, and obviously there are a very large number of them.
The total environme...