Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: , published by on the AI Alignment Forum.
Highlights
OpenAI Five (Many people at OpenAI): OpenAI has trained a team of five neural networks to play a particular set of Dota heroes in a mirror match (playing against the same set of heroes) with a few restrictions, and have started to beat amateur human players. They are aiming to beat a team of top professionals at The International in August, with the same set of five heroes, but without any other restrictions. Salient points:
The method is remarkably simple -- it's a scaled up version of PPO with training data coming from self-play, with reward shaping and some heuristics for exploration, where each agent is implemented by an LSTM.
There's no human data apart from the reward shaping and exploration heuristics.
Contrary to most expectations, they didn't need anything fundamentally new in order to get long-term strategic planning. I was particularly surprised by this. Some interesting thoughts from OpenAI researchers in this thread -- in particular, assuming good exploration, the variance of the gradient should scale linearly with the duration, and so you might expect you only need linearly more samples to counteract this.
They used 256 dedicated GPUs and 128,000 preemptible CPUs. A Hacker News comment estimates the cost at $2500 per hour, which would put the likely total cost in the millions of dollars.
They simulate 900 years of Dota every day, which is a ratio of ~330,000:1, suggesting that each CPU is running Dota ~2.6x faster than real time. In reality, it's probably running many times faster than that, but preemptions, communication costs, synchronization etc. all lead to inefficiency.
There was no explicit communication mechanism between agents, but they all get to observe the full Dota 2 state (not pixels) that any of the agents could observe, so communication is not really necessary.
A version of the code with a serious bug was still able to train to beat humans. Not encouraging for safety.
Alex Irpan covers some of these points in more depth in Quick Opinions on OpenAI Five.
Gwern comments as well.
My opinion: I might be more excited by an approach that was able to learn from human games (which are plentiful), and perhaps finetune with RL, in order to develop an approach that could generalize to more tasks in the future, where human data is available but a simulator is not. (Given the ridiculous sample complexity, pure RL with PPO can only be used in tasks with a simulator.) On the other hand, an approach that leveraged human data would necessarily be at least somewhat specific to Dota. A dependence on human data is unlikely to get us to general intelligence, whereas this result suggests that we can solve tasks that have a simulator, exploration strategy, and a dense reward function, which really is pushing the boundary on generality. This seems to be gdb's take: "We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding."
Paul's research agenda FAQ (zhukeepa): Exactly what it sounds like. I'm not going to summarize it because it's long and covers a lot of stuff, but I do recommend it.
Technical AI alignment
Technical agendas and prioritization
Conceptual issues in AI safety: the paradigmatic gap (Jon Gauthier): Lots of current work on AI safety focuses on what we can call "mid-term safety" -- the safety of AI systems that are more powerful and more broadly deployed than the ones we have t...
view more