Download - Alignment Newsletter #13: 07/02/18 by Rohin Shah

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Alignment Newsletter #13: 07/02/18 by Rohin Shah

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: , published by on the AI Alignment Forum.
Highlights
OpenAI Five (Many people at OpenAI): OpenAI has trained a team of five neural networks to play a particular set of Dota heroes in a mirror match (playing against the same set of heroes) with a few restrictions, and have started to beat amateur human players. They are aiming to beat a team of top professionals at The International in August, with the same set of five heroes, but without any other restrictions. Salient points:
The method is remarkably simple -- it's a scaled up version of PPO with training data coming from self-play, with reward shaping and some heuristics for exploration, where each agent is implemented by an LSTM.
There's no human data apart from the reward shaping and exploration heuristics.
Contrary to most expectations, they didn't need anything fundamentally new in order to get long-term strategic planning. I was particularly surprised by this. Some interesting thoughts from OpenAI researchers in this thread -- in particular, assuming good exploration, the variance of the gradient should scale linearly with the duration, and so you might expect you only need linearly more samples to counteract this.
They used 256 dedicated GPUs and 128,000 preemptible CPUs. A Hacker News comment estimates the cost at $2500 per hour, which would put the likely total cost in the millions of dollars.
They simulate 900 years of Dota every day, which is a ratio of ~330,000:1, suggesting that each CPU is running Dota ~2.6x faster than real time. In reality, it's probably running many times faster than that, but preemptions, communication costs, synchronization etc. all lead to inefficiency.
There was no explicit communication mechanism between agents, but they all get to observe the full Dota 2 state (not pixels) that any of the agents could observe, so communication is not really necessary.
A version of the code with a serious bug was still able to train to beat humans. Not encouraging for safety.
Alex Irpan covers some of these points in more depth in Quick Opinions on OpenAI Five.
Gwern comments as well.
My opinion: I might be more excited by an approach that was able to learn from human games (which are plentiful), and perhaps finetune with RL, in order to develop an approach that could generalize to more tasks in the future, where human data is available but a simulator is not. (Given the ridiculous sample complexity, pure RL with PPO can only be used in tasks with a simulator.) On the other hand, an approach that leveraged human data would necessarily be at least somewhat specific to Dota. A dependence on human data is unlikely to get us to general intelligence, whereas this result suggests that we can solve tasks that have a simulator, exploration strategy, and a dense reward function, which really is pushing the boundary on generality. This seems to be gdb's take: "We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding."
Paul's research agenda FAQ (zhukeepa): Exactly what it sounds like. I'm not going to summarize it because it's long and covers a lot of stuff, but I do recommend it.
Technical AI alignment
Technical agendas and prioritization
Conceptual issues in AI safety: the paradigmatic gap (Jon Gauthier): Lots of current work on AI safety focuses on what we can call "mid-term safety" -- the safety of AI systems that are more powerful and more broadly deployed than the ones we have t...