In This Article
Subscribe to our newsletter
With Sora's announcement, could this be the "iPhone" moment for Generative Video AI?
We surveyed the internet for people's reactions to Sora. At first blush, it's no surprise. Most are simply amazed by the technology. However, it didn't stop there. We'll share the diverse opinions, both positive and negative, that we found on SORA's capabilities, implications, future impact, and what it means to digital publishers.
What is SORA?
Sora (meaning "sky" in Japanese) is a state-of-the-art text-to-video model that can generate high-quality, high-fidelity 1-minute videos with different aspect ratios and resolutions.
By combining the diffusion model (DALL-E 3) with a transformer architecture (GPT), Sora processes videos (temporal sequences of image frames) like ChatGPT processes text.
A diffusion model starts with random noise and iterates toward a "clean" image that fits an input prompt. It then generates a video from a sequence of such images. However, maintaining coherence and consistency between frames takes much work. Sora solves this issue using a transformer architecture to handle how frames relate.
Drawing inspiration from DeepMind's work on Vision Transformers, Sora represents videos and images as collections of smaller units of data called "spacetime patches," each akin to a token in GPT.
Here is a high-level visualization of the model:
Source: OpenAI's technical report
Judging by the videos OpenAI has shared, Sora's unparalleled quality of motion, an impressive understanding of object physics and their interconnections, and an impressive grasp of object permanence appear to surpass everything else in the field, including public models like SVD 1.1 and those from Pika Labs & Runway.
The Technical Marvel of SORA
Why is Sora a big deal?
Sora performs its work by manipulating pixels and conceptualizing three-dimensional scenes that unfold in time. In our heads, we probably do something similar - when we picture scenes and places in our mind's eye, we don't just imagine how they look but also know what they are.
Sora is generalized and scalable.
Sora can create multiple shots within a single generated video that can accurately persist characters and visual style. It can sample widescreen 1920x1080p videos, vertical 1080x1920 videos, and everything in between:
Source: Tim Brooks
And because the model is based on a transformer architecture, it scales with available compute:
Source: Tsarathustra
Considering how far AI has progressed since this video of Will Smith eating spaghetti (released less than a year ago), models like Sora could create Hollywood-style multi-scene, multi-character complex videos up to 5 or 10 minutes long in a few months.
Sora is a learnable simulator, a "world model."
Sora understands not only style, scenery, character, objects, and concepts in the prompt but also how these things exist in the physical world. Its simulation capabilities include:
- 3D consistency
- Long-range coherence and object permanence (even when they leave the frame)
- Object interactions (e.g., new strokes on a canvas will persist over time)
- Simulating digital worlds (this Minecraft world, for example)
In a way, Sora is a "data-driven physics engine" that can "learn" a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos.
It can produce complex simulations like this with an astounding rendition of water and light. Video games took several decades to get here, and they could only do it because of the recent advancements in ray tracing.
In other words, think of Sora as an intelligent computer program that (without prior knowledge) watches lots of videos to learn how things work in the real world, improves with practice, and delivers a high level of video realism that games took many decades to master.
The Types of Reactions
The internet's reaction to Sora is mixed. There are optimists, skeptics, critics. All of them are speculating on the implications of the seemingly rapid advancement in video AI.
The AI Optimists
In the tech and AI communities, the tone is predominantly positive, with users expressing amazement and enthusiasm about the model's capability to revolutionize industries like film, art, and video games.
The AI Critics
Creative artists, however, share the opposite sentiment - a mix of awe, fear, and contemplation. There's a sense of inevitability that the model will significantly alter or even replace the need for human creativity. Some artists expressed hope or belief that AI cannot fully replicate uniquely human creativity and emotion.
The AI Skeptics
In general, there is also some skepticism arising from concerns about cherry-picking examples by OpenAI. Users are taking a cautious approach and waiting to observe the generalizability and reliability of the model when faced with specific, detailed prompts.
Will Sora's performance match the expectations?
While Sora's release videos have wowed the masses, much speculation exists around the model. Little is known about the model's inner workings and the training data that OpenAI used. Some speculations hint at synthetic data generated using a game engine and real-world video.
OpenAI wants people to believe that Sora is a step towards building general-purpose simulations of the physical world. Critics like Gary Marcus have questioned whether such models could represent physics accurately.
Marcus points to the violation of spatiotemporal continuity and failure of object permanence in several clips - flaws that he attributes to how the system reconstructs reality (and not errors in learning data). These glitches are what "hallucinations" are to LLMs.
"Sora is not a solution to AI's longstanding woes with space, time, and causality" - he concludes.
Yann LeCun, the chief AI scientist at Meta, feels Sora is "doomed to failure," specifically contesting OpenAI's claims that Sora will enable the building of general-purpose world simulators.
LeCun refers to an age-old debate in machine learning between generative and discriminative models and believes that the former approach, generating pixels "from explanatory latent variables," is inefficient and can't adequately deal with the complex predictions in 3D space.
Jim Fan, senior AI research scientist at graphics firm Nvidia, is convinced Sora is a much bigger deal and is bullish on the model's capability to simulate physics. He believes more modalities and conditioning could make Sora a "full data-driven UE [Unreal Engine] that will replace all the hand-engineered graphics pipelines."
Sora is not the first text-to-video model. But it is a vast improvement over earlier models like Emu by Meta, Gen-2 by Runway, Stable Video Diffusion by Stability AI, and recently Lumiere by Google. None of them generated the buzz comparable to Sora.
It is more powerful, offers a higher resolution, can generate longer, multi-shot videos, and is reportedly capable of video-editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time.
Forrester senior analyst Rowan Curran feels that in most applications that use AI-generated video, Sora's consistency and length represent "new opportunities for creatives to incorporate elements of AI-generated video into traditional content. They could even construct full-blown narrative videos using one or a few prompts".
Industry Perspectives on SORA
Positive Outlooks
Sora could profoundly impact two broad areas. The first is marketing, entertainment, and digital media, and the second is learning and development.
In marketing, Sora could do what ChatGPT did to textual content creation. It could be used to create highly engaging and visually appealing videos. It could help brands stand out through greater creativity and personalization.
Taking note of the widespread "wow" feeling Sora has generated among marketers and content creators, Don Anderson, CEO of Kaddadle Consultancy, says, "Sora's wider application in small-to-medium sized business marketing campaigns and video ads will further democratize and drive down the costs of video creation."
In education and training, Sora could be used to develop videos tailored to specific topics or scenarios - making complex information more accessible, easier to understand, and engaging.
Michael Horn, the Co-Founder of the Clayton Christensen Institute, sees Sora as a tool and a gateway to democratizing Education. "You can have different background knowledge and experiences, and now you can find the right video for you, dramatically lowering the barriers to learning anything," he says.
Sectors such as e-commerce also hold promising potential for future use cases. Retailers could create dynamic product demonstrations to showcase products more engagingly and interactively.
Sora is, without a doubt, a foundational model with advanced capabilities. However, its journey from inception to widespread adoption will require tailoring to specific industry use cases. Techniques such as fine-tuning, transfer learning, prompt engineering, embedded customization, and interfacing could further enhance particular use cases.
In digital publishing, for example, Sora's capabilities could be leveraged by services like Aeon to enable publishers to generate high-quality video at a scale while retaining the journalistic voice - something difficult to accomplish if one were to work with a native model.
Negative Outlooks
Sora's impressive capabilities arrive at a rather dangerous time. 2024 is a bumper election year. More elections are in the next 300 days than in any historical year.
Left unregulated, Sora could be used to deceive, create harmful deepfakes, generate and spread disinformation, or sow chaos.
Some have expressed concerns that technologies like ChatGPT, DALL-E, and now Sora could be positioned as more efficient writers, visual artists, and filmmakers, potentially swallowing up creative arts that empower people to express and feel human.
Sora has sent ripples across Hollywood. Actor, filmmaker, and studio owner Tyler Perry put his $800M studio expansion plans on hold after witnessing Sora's capabilities.
OpenAI's secretive development, rapid commercialization, corporate lobbying, pursuits for market dominance, and $7 trillion valuations have sparked debate. The company also stands accused of killing the shared research community-oriented approach to building AI that has existed for decades.
Its employees are asked to work secretly and stop publishing research papers. This has upset the symbiotic relationship between the pursuit of profits and scientific discovery.
While the creators of Sora and those leveraging its capabilities stand to gain money and power, it is hard to estimate Sora's broader benefits to society. Regulators and users must carefully consider critical factors that could pose challenges, including copyright issues, ethical concerns, and the consequences of increased digital noise.
Would Sora be a net positive to humanity? Only time can tell.
Professional Insights
Sora may not be generally available right now, but the knowledge of its existence has already spread ripples across professional communities.
High-end videographers and animators believe that Sora can ease their workflow, but they are unsure how to "communicate" with the model through text alone. Usually, effects and details are fine-tuned using advanced software that allows control over many more parameters.
Image editors believe Sora-like tools are far more powerful and manipulate reality via a black box approach where humans have little control and require little skill. This contrasts with using tools like Photoshop, which requires a lot more effort to master.
Those in creative advertising feel that Sora will be a game changer and upend many production companies and agencies if they ignore it.
In Education, universities are gearing up to offer courses for jobs that don't yet exist or may be wholly revolutionized by the time they graduate.
Ethical and Societal Considerations
From an ethical standpoint, it's somewhat unsettling to consider that Sora's source material is pictures and ideas - the idea of cities like Tokyo, the concept of a family or a group of friends, and a "beautiful homemade video."
Sora isn't Photoshop - it contains knowledge about what it shows us. This raises questions about how a "synthetic" video generated by Sora could be used.
Tools like Sora can easily make things worse in a world already plagued by disinformation. Bad actors could use it to endanger public health measures, influence elections, or even burden the justice system with potential fake evidence.
Even beyond these concerns, there are questions of copyright and intellectual property. Generative AI tools require vast data for training, and we need to find out where Sora's training data comes from.
As AI-generated content becomes indistinguishable from reality, it will prompt a reassessment of what we consider "real" and how we validate the truthfulness of visual media.
Managing trust in an age of synthetic videos will be challenging. People may question the authenticity of every footage, which may impact how information in video form is consumed and believed.
While executives and lawmakers agree that new AI systems must be regulated, there needs to be more clarity on how that could be achieved.
As seen with DALL-E 2, it is never a single model that impacts society as a whole, but the wave of models it inspires. The future is more than what Sora can do but what its competitors and imitators will be able to do.
The Future of SORA
From computer games to social media, the path of evolution has been the same: text to images to video. Industry experts had predicted that generative AI would follow this path, but they needed to be more accurate about the timeline.
When DALL-E 2 was released in April 2022, several industry experts claimed AI-generated videos would take several years. Stable Diffusion followed only months later.
Sora is still a research preview; the costs are unknown but almost undoubtedly substantial. However, as with computing, it is bound to come down over time. Technologies like Customized AI chips and Edge AI could expand its reach and use cases further.
AI was expected to add trillions of dollars in value even without Sora. Game-changing applications will emerge as professionals across various industries get their hands on Soral. Advanced content creation, real-time video editing, personalized entertainment, and Education - the possibilities are endless.
In a not-so-distant future, video generators like Sora may be capable of simulating the physical and digital world and the objects, animals, and people that live within them. Future versions could even have scientific applications for physical, chemical, and societal experiments.
OpenAI has got the power of compounding working to its advantage. Its earlier models are helping to create more complex ones. Sora follows prompts well because, like DALL-E 3, it utilizes synthetic captions that describe scenes in the training data generated by another AI model like GPT-4V.
And OpenAI will not stop here - it remains committed to achieving AGI.
What this means for digital publishers
For digital publishers, Sora enables technology to embrace the new age of video AI. At a time when the traditional broadcast, print, and digital news media industry is going through a challenging time, models like Sora will lower the cost and ease the entry barriers to Video production.
However, we don't believe that Sora will effectively resolve every use case alone. For instance, while a tool like Sora excels in generating enhanced video footage or establishing shots in lieu of stock to attain the greatest potential and output from these foundational creative tools, it requires an application layer to orchestrate all the latest research tools that will adapt to existing workflows. Additionally, will it effectively follow brand-compliant visual elements, and showcase text diagrams, flow charts, and other common to marketers and publishers? Maybe, maybe not. In other words, Sora can help your business, but having a smart application layer like Aeon will make life easier, and help you get the most out of the latest AI research.
And so, there is room for a diverse toolkit to address the varied needs of video storytelling.
Recognizing the limitations of existing generative foundational tools, we are inspired to explore the creation of specialized solutions that can fill these gaps.
And this is how the market is forming today.
This endeavor is not just about harnessing the power of AI but also about ensuring that these technologies enhance, rather than dilute, the brand essence and storytelling quality. Our vision is to streamline and be adaptive to your workflow that encompasses the entire video production process, from initial concept to final edit, ensuring that each piece of content we produce resonates with our audience and stays true to the brand's voice.
Conclusion
Sora represents a significant leap in GenAI video, offering powerful capabilities to streamline video production and enhance storytelling.
As we look to the future, the potential for innovation in digital storytelling is boundless. The key to unlocking this potential lies in our ability to adapt, innovate, and integrate new technologies with a deep understanding of our audience's expectations and brand identity. By doing so, we can create content that not only captivates and engages but also stands out in a crowded digital landscape.
This journey into the future of digital storytelling is just beginning. As we continue to explore the possibilities and navigate the challenges, our goal remains steadfast: to tell compelling stories that connect, inspire, and endure.
If you are a publisher, discover how Aeon uses technologies like Sora to revolutionize your content discovery and boost your advertising and consumer revenues.
Contact us now to explore the possibilities!