Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sora What, published by Zvi on February 23, 2024 on LessWrong.
Hours after Google announced Gemini 1.5, OpenAI announced their new video generation model Sora. Its outputs look damn impressive.
How Sora Works
How does it work? There is a technical report. Mostly it seems like OpenAI did standard OpenAI things, meaning they fed in tons of data, used lots of compute, and pressed the scaling button super hard. The innovations they are willing to talk about seem to be things like 'do not crop the videos into a standard size.'
That does not mean there are not important other innovations. I presume that there are. They simply are not talking about the other improvements.
We should not underestimate the value of throwing in massively more compute and getting a lot of the fiddly details right. That has been the formula for some time now.
Some people think that OpenAI was using a game engine to learn movement. Sherjil Ozair points out that this is silly, that movement is learned easily. The less silly speculation is that game engine outputs may have been in the training data. Jim Fan thinks this is likely the case, and calls the result a 'data-driven physics engine.' Raphael Molière thinks this is likely, but more research is needed.
Brett Goldstein here digs into what it means that Sora works via 'patches' that combine to form the requested scene.
Gary Marcus keeps noting how the model gets physics wrong in various places, and, well, yes, we all know, please cut it out with the Stop Having Fun.
Yishan points out that humans also work mostly on 'folk physics.' Most of the time humans are not 'doing logic' they are vibing and using heuristics. I presume our dreams, if mapped to videos, would if anything look far less realistic than Sora.
Yann LeCun, who only a few days previous said that video like Sora produces was not something we knew how to do, doubled down with the ship to say that none of this means the models 'understand the physical world,' and of course his approach is better because it does. Why update? Is all of this technically impressive?
Sora Is Technically Impressive
Yes, Sora is definitely technically impressive.
It was not, however, unexpected.
Sam Altman: we'd like to show you what sora can do, please reply with captions for videos you'd like to see and we'll start making some!
Eliezer Yudkowsky: 6 months left on this timer.
Eliezer Yudkowsky (August 26, 2022): In 2-4 years, if we're still alive, anytime you see a video this beautiful, your first thought will be to wonder whether it's real or if the AI's prompt was "beautiful video of 15 different moth species flapping their wings, professional photography, 8k, trending on Twitter".
Roko (other thread): I don't really understand why anyone is freaking out over Sora.
This is entirely to be expected given the existence of generative image models plus incrementally more hardware and engineering effort.
It's also obviously not dangerous (in a "take over the world" sense).
Eliezer Yudkowsky: This is of course my own take (what with having explicitly predicted this). But I do think you want to hold out a space for others to say, "Well *I* didn't predict it, and now I've updated."
Altman's account spent much of last Thursday making videos for people's requests, although not so many that they couldn't cherry pick the good ones.
As usual, there are failures that look stupid, mistakes 'a person would never make' and all that. And there are flashes of absolute brilliance.
How impressive? There are disputes.
Tom Warren: this could be the "holy shit" moment of AI. OpenAI has just announced Sora, its text-to-video AI model. This video isn't real, it's based on a prompt of "a cat waking up its sleeping owner demanding breakfast…"
Daniel Eth: This isn't impressive. The owner doesn't wake up, so the AI clearly didn't understa...
view more