Download - LW - OpenAI's Sora is an agent by CBiddulph

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - OpenAI's Sora is an agent by CBiddulph

2024-02-16

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI's Sora is an agent, published by CBiddulph on February 16, 2024 on LessWrong.
If you haven't already, take a look at Sora, OpenAI's new text-to-video AI. Sora can create scarily-realistic videos of nearly any subject. Unlike previous state-of-the-art AIs, the videos are coherent across time scales as long as one minute, and they can be much more complex.
Looking through OpenAI's research report, this one section caught my attention:
For a moment, I was confused: "what does it mean, Sora can 'control the player in Minecraft with a basic policy?' It's generating footage of a video game, not actually playing it... right?"
It's true that in these particular demo videos, Sora is "controlling the player" in its own internal model, rather than interfacing with Minecraft itself. However, I believe OpenAI is hinting that Sora can open the door to a much broader set of applications than just generating video.
In this post, I'll sketch an outline of how Sora could be used as an agent that plays any video game. With a bit of "visual prompt engineering," I believe this would even be possible with zero modifications to the base model. You could easily improve the model's efficiency and reliability by fine-tuning it and adding extra types of tokens, but I'll refrain from writing about that here.
The capabilities I'm predicting here aren't totally novel - OpenAI itself actually trained an AI to do tasks in Minecraft, very similarly to what I'll describe here.
What interests me is that Sora will likely be able to do many general tasks without much or any specialized training. In much the same way that GPT-3 learned all kinds of unexpected emergent capabilities just by learning to "predict the next token," Sora's ability to accurately "predict the next frame" could let it perform many visual tasks that depend on long-term reasoning.
Sorry if this reads like an "advancing capabilities" kind of post. Based on some of the wording throughout their research report, I believe OpenAI is already well aware of this, and it would be better for people to understand the implications of Sora sooner rather than later.
How to play any video game by predicting the next frame
Recall from the OpenAI report that Sora can take any video clip as input and predict how it will continue. To start it off, let's give it a one-second clip from the real Minecraft video game, showing the player character shuffling around a bit. At the bottom of that video, we'll add a virtual keyboard and mouse to the screen. The keys and buttons will turn black whenever the player presses them, and an arrow will indicate the mouse's current velocity:
If we ask Sora to continue the video with a short clip, it'll keep making the player character move around. Hopefully, it'll also change the display to reflect the actions the player is making - for instance, the left mouse button should turn black whenever the player interacts with an object. Video game streamers sometimes play with virtual keyboards on their screen, so I don't think it would be a huge logical leap for Sora to be able to accurately highlight the right keys.
This is how we can let Sora take "actions." Suppose that right after recording that one-second clip, we stop the game and wait for Sora to predict the next 0.1 seconds of the video. Once we have our results, we just take the average color of each key in the last frame of the predicted video and determine which buttons Sora thinks the player will be pressing.
Finally, we continue the game for 0.1 seconds, holding down those buttons, and feed the 1.1 seconds of real Minecraft video into Sora to get its next move.
Now Sora is moving around, doing some things that would be pretty reasonable for a human player to do. To give it some direction, let's add the text prompt "building a house." This will make Sora t...